Re: [OMPI users] SLURM and OpenMPI

2008-06-23 Thread Sacerdoti, Federico
Ralph, 

Thanks for your reply. Let me know if I can help in any way.

fds 

-Original Message-
From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On
Behalf Of Ralph H Castain
Sent: Thursday, June 19, 2008 10:24 AM
To: Sacerdoti, Federico; Open MPI Users <us...@open-mpi.org>
Subject: Re: [OMPI users] SLURM and OpenMPI

Well, if the only system I cared about was slurm, there are some things
I
could possibly do to make things better, but at the expense of our
support
for other environments - which is unacceptable.

There are a few technical barriers to doing this without the orteds on
slurm, and a major licensing issue that prohibits us from calling any
slurm
APIs. How all that gets resolved is unclear.

Frankly, one reason we don't put more emphasis on it is that we don't
see a
significant launch time difference between the two modes, and we truly
do
want to retain the ability to utilize different error response
strategies
(which slurm will not allow - you can only follow theirs).

So I would say we simply have different objectives than what you stated,
and
different concerns that make a deeper slurm integration less favorable.
May
still happen, but not anytime soon.

Ralph



On 6/19/08 8:08 AM, "Sacerdoti, Federico"
<federico.sacerd...@deshawresearch.com> wrote:

> Ralph thanks for your quick response.
> 
> Regarding your fourth paragraph, slurm will not let you run on a
> no-longer-valid allocation, an srun will correctly exit non-zero with
a
> useful failure reason. So perhaps openmpi 1.3 with your changes will
> just work, I look forward to testing it.
> 
> E.g.
> $ srun hostname
> srun: error: Unable to confirm allocation for job 745346: Invalid job
id
> specified
> srun: Check SLURM_JOBID environment variable for expired or invalid
job.
> 
> 
> Regarding srun to launch the jobs directly (no orteds), I am sad to
hear
> the idea is not in favor. We have found srun to be extremely scalable
> (tested up to 4096 MPI processes) and very good at cleaning up after
an
> error or node failure. It seems you could simplify orterun quite a bit
> by relying on slurm (or whatever  resource manager) to handle job
> cleanup after failures; it is their responsibility after all, and they
> have better knowledge about the health and availability of nodes than
> any launcher can hope for.
> 
> I helped write an mvapich launcher used internally called mvrun, which
> was used for several years. I wrote a lot of logic to run down and
stop
> all processes when one had failed, which I understand you have as
well.
> We came to the conclusion that slurm was in a better position to
handle
> such failures, and in fact did it more effectively. For example if
slurm
> detects a node has failed, it will stop the job, allocate an
additional
> free node to make up the deficit, then relaunch. It more difficult (to
> put it mildly) for a job launcher to do that.
> 
> Thanks again,
> Federico
> 
> -Original Message-
> From: Ralph H Castain [mailto:r...@lanl.gov]
> Sent: Tuesday, June 17, 2008 1:09 PM
> To: Sacerdoti, Federico; Open MPI Users <us...@open-mpi.org>
> Subject: Re: [OMPI users] SLURM and OpenMPI
> 
> I can believe 1.2.x has problems in that regard. Some of that has
> nothing to
> do with slurm and reflects internal issues with 1.2.
> 
> We have made it much more resistant to those problems in the upcoming
> 1.3
> release, but there is no plan to retrofit those changes to 1.2. Part
of
> the
> problem was that we weren't using the --kill-on-bad-exit flag when we
> called
> srun internally, which has been fixed for 1.3.
> 
> BTW: we actually do use srun to launch the daemons - we just call it
> internally from inside orterun. The only real difference is that we
use
> orterun to setup the cmd line and then tell the daemons what they need
> to
> do. The issues you are seeing relate to our ability to detect that
srun
> has
> failed, and/or that one or more daemons have failed to launch or do
> something they were supposed to do. The 1.2 system has problems in
that
> regard, which was one motivation for the 1.3 overhaul.
> 
> I would argue that slurm allowing us to attempt to launch on a
> no-longer-valid allocation is a slurm issue, not OMPI's. As I said, we
> use
> srun to launch the daemons - the only reason we hang is that srun is
not
> returning with an error. I've seen this on other systems as well, but
> have
> no real answer - if slurm doesn't indicate an error has occurred, I'm
> not
> sure what I can do about it.
> 
> We are unlikely to use srun to directly launch jobs (i.e., to have
slurm
> directly launch the job from an srun cmd line without mpirun) anytime
> soon.
> It isn't clear there is enough benefit to justify the rather large

Re: [OMPI users] SLURM and OpenMPI

2008-06-19 Thread Ralph H Castain
Well, if the only system I cared about was slurm, there are some things I
could possibly do to make things better, but at the expense of our support
for other environments - which is unacceptable.

There are a few technical barriers to doing this without the orteds on
slurm, and a major licensing issue that prohibits us from calling any slurm
APIs. How all that gets resolved is unclear.

Frankly, one reason we don't put more emphasis on it is that we don't see a
significant launch time difference between the two modes, and we truly do
want to retain the ability to utilize different error response strategies
(which slurm will not allow - you can only follow theirs).

So I would say we simply have different objectives than what you stated, and
different concerns that make a deeper slurm integration less favorable. May
still happen, but not anytime soon.

Ralph



On 6/19/08 8:08 AM, "Sacerdoti, Federico"
<federico.sacerd...@deshawresearch.com> wrote:

> Ralph thanks for your quick response.
> 
> Regarding your fourth paragraph, slurm will not let you run on a
> no-longer-valid allocation, an srun will correctly exit non-zero with a
> useful failure reason. So perhaps openmpi 1.3 with your changes will
> just work, I look forward to testing it.
> 
> E.g.
> $ srun hostname
> srun: error: Unable to confirm allocation for job 745346: Invalid job id
> specified
> srun: Check SLURM_JOBID environment variable for expired or invalid job.
> 
> 
> Regarding srun to launch the jobs directly (no orteds), I am sad to hear
> the idea is not in favor. We have found srun to be extremely scalable
> (tested up to 4096 MPI processes) and very good at cleaning up after an
> error or node failure. It seems you could simplify orterun quite a bit
> by relying on slurm (or whatever  resource manager) to handle job
> cleanup after failures; it is their responsibility after all, and they
> have better knowledge about the health and availability of nodes than
> any launcher can hope for.
> 
> I helped write an mvapich launcher used internally called mvrun, which
> was used for several years. I wrote a lot of logic to run down and stop
> all processes when one had failed, which I understand you have as well.
> We came to the conclusion that slurm was in a better position to handle
> such failures, and in fact did it more effectively. For example if slurm
> detects a node has failed, it will stop the job, allocate an additional
> free node to make up the deficit, then relaunch. It more difficult (to
> put it mildly) for a job launcher to do that.
> 
> Thanks again,
> Federico
> 
> -Original Message-
> From: Ralph H Castain [mailto:r...@lanl.gov]
> Sent: Tuesday, June 17, 2008 1:09 PM
> To: Sacerdoti, Federico; Open MPI Users <us...@open-mpi.org>
> Subject: Re: [OMPI users] SLURM and OpenMPI
> 
> I can believe 1.2.x has problems in that regard. Some of that has
> nothing to
> do with slurm and reflects internal issues with 1.2.
> 
> We have made it much more resistant to those problems in the upcoming
> 1.3
> release, but there is no plan to retrofit those changes to 1.2. Part of
> the
> problem was that we weren't using the --kill-on-bad-exit flag when we
> called
> srun internally, which has been fixed for 1.3.
> 
> BTW: we actually do use srun to launch the daemons - we just call it
> internally from inside orterun. The only real difference is that we use
> orterun to setup the cmd line and then tell the daemons what they need
> to
> do. The issues you are seeing relate to our ability to detect that srun
> has
> failed, and/or that one or more daemons have failed to launch or do
> something they were supposed to do. The 1.2 system has problems in that
> regard, which was one motivation for the 1.3 overhaul.
> 
> I would argue that slurm allowing us to attempt to launch on a
> no-longer-valid allocation is a slurm issue, not OMPI's. As I said, we
> use
> srun to launch the daemons - the only reason we hang is that srun is not
> returning with an error. I've seen this on other systems as well, but
> have
> no real answer - if slurm doesn't indicate an error has occurred, I'm
> not
> sure what I can do about it.
> 
> We are unlikely to use srun to directly launch jobs (i.e., to have slurm
> directly launch the job from an srun cmd line without mpirun) anytime
> soon.
> It isn't clear there is enough benefit to justify the rather large
> effort,
> especially considering what would be required to maintain scalability.
> Decisions on all that are still pending, though, which means any
> significant
> change in that regard wouldn't be released until sometime next year.
> 
> Ralph
> 
> On 6/17/08 10:39 AM, "Sacerdoti, Federico"
> <

Re: [OMPI users] SLURM and OpenMPI

2008-06-17 Thread Ralph H Castain
I can believe 1.2.x has problems in that regard. Some of that has nothing to
do with slurm and reflects internal issues with 1.2.

We have made it much more resistant to those problems in the upcoming 1.3
release, but there is no plan to retrofit those changes to 1.2. Part of the
problem was that we weren't using the --kill-on-bad-exit flag when we called
srun internally, which has been fixed for 1.3.

BTW: we actually do use srun to launch the daemons - we just call it
internally from inside orterun. The only real difference is that we use
orterun to setup the cmd line and then tell the daemons what they need to
do. The issues you are seeing relate to our ability to detect that srun has
failed, and/or that one or more daemons have failed to launch or do
something they were supposed to do. The 1.2 system has problems in that
regard, which was one motivation for the 1.3 overhaul.

I would argue that slurm allowing us to attempt to launch on a
no-longer-valid allocation is a slurm issue, not OMPI's. As I said, we use
srun to launch the daemons - the only reason we hang is that srun is not
returning with an error. I've seen this on other systems as well, but have
no real answer - if slurm doesn't indicate an error has occurred, I'm not
sure what I can do about it.

We are unlikely to use srun to directly launch jobs (i.e., to have slurm
directly launch the job from an srun cmd line without mpirun) anytime soon.
It isn't clear there is enough benefit to justify the rather large effort,
especially considering what would be required to maintain scalability.
Decisions on all that are still pending, though, which means any significant
change in that regard wouldn't be released until sometime next year.

Ralph

On 6/17/08 10:39 AM, "Sacerdoti, Federico"
<federico.sacerd...@deshawresearch.com> wrote:

> Ralph,
> 
> I was wondering what the status of this feature was (using srun to
> launch orted daemons)? I have two new bug reports to add from our
> experience using orterun from 1.2.6 on our 4000 CPU infiniband cluster.
> 
> 1. Orterun will happily hang if it is asked to run on an invalid slurm
> job, e.g. if the job has exceeded its timelimit. This would be trivially
> fixed if you used srun to launch, as they would fail with non-zero exit
> codes.
> 
> 2. A very simple orterun invocation hangs instead of exiting with an
> error. In this case the executable does not exist, and we would expect
> orterun to exit non-zero. This has caused
> headaches with some workflow management script that automatically start
> jobs.
> 
> salloc -N2 -p swdev orterun dummy-binary-I-dont-exist
> [hang]
> 
> orterun dummy-binary-I-dont-exist
> [hang]
> 
> Thanks,
> Federico
> 
> -Original Message-
> From: Sacerdoti, Federico
> Sent: Friday, March 21, 2008 5:41 PM
> To: 'Open MPI Users'
> Subject: RE: [OMPI users] SLURM and OpenMPI
> 
> 
> Ralph wrote:
> "I don't know if I would say we "interfere" with SLURM - I would say
> that we
> are only lightly integrated with SLURM at this time. We use SLURM as a
> resource manager to assign nodes, and then map processes onto those
> nodes
> according to the user's wishes. We chose to do this because srun applies
> its
> own load balancing algorithms if you launch processes directly with it,
> which leaves the user with little flexibility to specify their desired
> rank/slot mapping. We chose to support the greater flexibility."
>  
> Ralph, we wrote a launcher for mvapich that uses srun to launch but
> keeps tight control of where processes are started. The way we did it
> was to force srun to launch a single process on a particular node.
> 
> The launcher calls many of these:
>  srun --jobid $JOBID -N 1 -n 1 -w host005 CMD ARGS
> 
> Hope this helps (and we are looking forward to a tighter orterun/slurm
> integration as you know).
> 
> Regards,
> Federico
> 
> -Original Message-
> From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On
> Behalf Of Ralph Castain
> Sent: Thursday, March 20, 2008 6:41 PM
> To: Open MPI Users <us...@open-mpi.org>
> Cc: Ralph Castain
> Subject: Re: [OMPI users] SLURM and OpenMPI
> 
> Hi there
> 
> I am no slurm expert. However, it is our understanding that
> SLURM_TASKS_PER_NODE means the number of slots allocated to the job, not
> the
> number of tasks to be executed on each node. So the 4(x2) tells us that
> we
> have 4 slots on each of two nodes to work with. You got 4 slots on each
> node
> because you used the -N option, which told slurm to assign all slots on
> that
> node to this job - I assume you have 4 processors on your nodes. OpenMPI
> parses that string to get the allocation, then maps the number of
> specified
> processes against it.
> 
&

Re: [OMPI users] SLURM and OpenMPI

2008-03-27 Thread Werner Augustin
On Thu, 20 Mar 2008 16:40:41 -0600
Ralph Castain  wrote:

> I am no slurm expert. However, it is our understanding that
> SLURM_TASKS_PER_NODE means the number of slots allocated to the job,
> not the number of tasks to be executed on each node. So the 4(x2)
> tells us that we have 4 slots on each of two nodes to work with. You
> got 4 slots on each node because you used the -N option, which told
> slurm to assign all slots on that node to this job - I assume you
> have 4 processors on your nodes. OpenMPI parses that string to get
> the allocation, then maps the number of specified processes against
> it.

That was also my interpretation and I was absolutely sure to have read
it a couple of days ago in the srun man-page. In the meantime I
changed my opinion because now it says "Number of tasks to be initiated
on each node" as Tim has quoted. I've no idea, how Tim managed to change
the man-page on my computer ;-)

and there is another variable documented: 

   SLURM_CPUS_ON_NODE
  Count of processors available to the job on this node.
Note the select/linear plugin allocates entire  nodes  to  jobs,  so
the value  indicates  the  total  count  of  CPUs  on the node.  The
  select/cons_res plugin allocates individual processors
to  jobs, so  this  number indicates the number of processors on this
node allocated to the job.


Anyway, back to reality: I made some further tests, and the only way to
change the values of SLURM_TASKS_PER_NODE was to tell slurm that node x
has only y cpus in slurm.conf. The variable documented as
SLURM_CPUS_ON_NODE (in 1.0.15 and 1.2.22) doesn't seem to exist in
either version. In 1.2.22 there seems to be SLURM_JOB_CPUS_PER_NODE
which has the same value as SLURM_TASKS_PER_NODE. In a couple of days
I'll try the other allocator plugin which allocates on a cpu base
instead of a node base. And after that it probably would be a good
idea, that somebody (me?) sums up our thread and asks the slurm guys
for their opinion.

> It is possible that the interpretation of SLURM_TASKS_PER_NODE is
> different when used to allocate as opposed to directly launch
> processes. Our typical usage is for someone to do:
> 
> srun -N 2 -A
> mpirun -np 2 helloworld
> 
> In other words, we use srun to create an allocation, and then run
> mpirun separately within it.
> 
> 
> I am therefore unsure what the "-n 2" will do here. If I believe the
> documentation, it would seem to imply that srun will attempt to
> launch two copies of "mpirun -np 2 helloworld", yet your output
> doesn't seem to support that interpretation. It would appear that the
> "-n 2" is being ignored and only one copy of mpirun is being
> launched. I'm no slurm expert, so perhaps that interpretation is
> incorrect.

That indeed happens when you call "srun -N 2 mpirun -np 2 helloworld",
but "srun -N 2 -b mpirun -np 2 helloworld" submits it as a batch-job,
i.e. "mpirun -np 2 helloworld" is executed only once on one of the
allocated nodes and environment variables are set appropriately -- or
at least should be set appropriately -- that a consecutive srun or
an mpirun inside the command does the right thing.

Werner 


Re: [OMPI users] SLURM and OpenMPI

2008-03-21 Thread Sacerdoti, Federico

Ralph wrote:
"I don't know if I would say we "interfere" with SLURM - I would say
that we
are only lightly integrated with SLURM at this time. We use SLURM as a
resource manager to assign nodes, and then map processes onto those
nodes
according to the user's wishes. We chose to do this because srun applies
its
own load balancing algorithms if you launch processes directly with it,
which leaves the user with little flexibility to specify their desired
rank/slot mapping. We chose to support the greater flexibility."

Ralph, we wrote a launcher for mvapich that uses srun to launch but
keeps tight control of where processes are started. The way we did it
was to force srun to launch a single process on a particular node. 

The launcher calls many of these:
 srun --jobid $JOBID -N 1 -n 1 -w host005 CMD ARGS

Hope this helps (and we are looking forward to a tighter orterun/slurm
integration as you know).

Regards,
Federico

-Original Message-
From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On
Behalf Of Ralph Castain
Sent: Thursday, March 20, 2008 6:41 PM
To: Open MPI Users <us...@open-mpi.org>
Cc: Ralph Castain
Subject: Re: [OMPI users] SLURM and OpenMPI

Hi there

I am no slurm expert. However, it is our understanding that
SLURM_TASKS_PER_NODE means the number of slots allocated to the job, not
the
number of tasks to be executed on each node. So the 4(x2) tells us that
we
have 4 slots on each of two nodes to work with. You got 4 slots on each
node
because you used the -N option, which told slurm to assign all slots on
that
node to this job - I assume you have 4 processors on your nodes. OpenMPI
parses that string to get the allocation, then maps the number of
specified
processes against it.

It is possible that the interpretation of SLURM_TASKS_PER_NODE is
different
when used to allocate as opposed to directly launch processes. Our
typical
usage is for someone to do:

srun -N 2 -A
mpirun -np 2 helloworld

In other words, we use srun to create an allocation, and then run mpirun
separately within it.


I am therefore unsure what the "-n 2" will do here. If I believe the
documentation, it would seem to imply that srun will attempt to launch
two
copies of "mpirun -np 2 helloworld", yet your output doesn't seem to
support
that interpretation. It would appear that the "-n 2" is being ignored
and
only one copy of mpirun is being launched. I'm no slurm expert, so
perhaps
that interpretation is incorrect.

Assuming that the -n 2 is ignored in this situation, your command line:

> srun -N 2 -n 2 -b mpirun -np 2 helloworld

will cause mpirun to launch two processes, mapped byslot against the
slurm
allocation of two nodes, each having 4 slots. Thus, both processes will
be
launched on the first node, which is what you observed.

Similarly, the command line

> srun -N 2 -n 2 -b mpirun helloworld

doesn't specify the #procs to mpirun. In that case, mpirun will launch a
process on every available slot in the allocation. Given this command,
that
means 4 procs will be launched on each of the 2 nodes, for a total of 8
procs. Ranks 0-3 will be placed on the first node, ranks 4-7 on the
second.
Again, this is what you observed.

I don't know if I would say we "interfere" with SLURM - I would say that
we
are only lightly integrated with SLURM at this time. We use SLURM as a
resource manager to assign nodes, and then map processes onto those
nodes
according to the user's wishes. We chose to do this because srun applies
its
own load balancing algorithms if you launch processes directly with it,
which leaves the user with little flexibility to specify their desired
rank/slot mapping. We chose to support the greater flexibility.

Using the SLURM-defined mapping will require launching without our
mpirun.
This capability is still under development, and there are issues with
doing
that in slurm environments which need to be addressed. It is at a lower
priority than providing such support for TM right now, so I wouldn't
expect
it to become available for several months at least.

Alternatively, it may be possible for mpirun to get the SLURM-defined
mapping and use it to launch the processes. If we can get it somehow,
there
is no problem launching it as specified - the problem is how to get the
map!
Unfortunately, slurm's licensing prevents us from using its internal
APIs,
so obtaining the map is not an easy thing to do.

Anyone who wants to help accelerate that timetable is welcome to contact
me.
We know the technical issues - this is mostly a problem of (a)
priorities
versus my available time, and (b) similar considerations on the part of
the
slurm folks to do the work themselves.

Ralph


On 3/20/08 3:48 PM, "Tim Prins" <tpr...@open-mpi.org> wrote:

> Hi Werner,
> 
> Open MPI does things a little bit differently than other MPIs when it
> comes to supporting SLURM. See
> http://www.open-mpi.org/faq/?category=s

Re: [OMPI users] SLURM and OpenMPI

2008-03-20 Thread Ralph Castain
Hi there

I am no slurm expert. However, it is our understanding that
SLURM_TASKS_PER_NODE means the number of slots allocated to the job, not the
number of tasks to be executed on each node. So the 4(x2) tells us that we
have 4 slots on each of two nodes to work with. You got 4 slots on each node
because you used the -N option, which told slurm to assign all slots on that
node to this job - I assume you have 4 processors on your nodes. OpenMPI
parses that string to get the allocation, then maps the number of specified
processes against it.

It is possible that the interpretation of SLURM_TASKS_PER_NODE is different
when used to allocate as opposed to directly launch processes. Our typical
usage is for someone to do:

srun -N 2 -A
mpirun -np 2 helloworld

In other words, we use srun to create an allocation, and then run mpirun
separately within it.


I am therefore unsure what the "-n 2" will do here. If I believe the
documentation, it would seem to imply that srun will attempt to launch two
copies of "mpirun -np 2 helloworld", yet your output doesn't seem to support
that interpretation. It would appear that the "-n 2" is being ignored and
only one copy of mpirun is being launched. I'm no slurm expert, so perhaps
that interpretation is incorrect.

Assuming that the -n 2 is ignored in this situation, your command line:

> srun -N 2 -n 2 -b mpirun -np 2 helloworld

will cause mpirun to launch two processes, mapped byslot against the slurm
allocation of two nodes, each having 4 slots. Thus, both processes will be
launched on the first node, which is what you observed.

Similarly, the command line

> srun -N 2 -n 2 -b mpirun helloworld

doesn't specify the #procs to mpirun. In that case, mpirun will launch a
process on every available slot in the allocation. Given this command, that
means 4 procs will be launched on each of the 2 nodes, for a total of 8
procs. Ranks 0-3 will be placed on the first node, ranks 4-7 on the second.
Again, this is what you observed.

I don't know if I would say we "interfere" with SLURM - I would say that we
are only lightly integrated with SLURM at this time. We use SLURM as a
resource manager to assign nodes, and then map processes onto those nodes
according to the user's wishes. We chose to do this because srun applies its
own load balancing algorithms if you launch processes directly with it,
which leaves the user with little flexibility to specify their desired
rank/slot mapping. We chose to support the greater flexibility.

Using the SLURM-defined mapping will require launching without our mpirun.
This capability is still under development, and there are issues with doing
that in slurm environments which need to be addressed. It is at a lower
priority than providing such support for TM right now, so I wouldn't expect
it to become available for several months at least.

Alternatively, it may be possible for mpirun to get the SLURM-defined
mapping and use it to launch the processes. If we can get it somehow, there
is no problem launching it as specified - the problem is how to get the map!
Unfortunately, slurm's licensing prevents us from using its internal APIs,
so obtaining the map is not an easy thing to do.

Anyone who wants to help accelerate that timetable is welcome to contact me.
We know the technical issues - this is mostly a problem of (a) priorities
versus my available time, and (b) similar considerations on the part of the
slurm folks to do the work themselves.

Ralph


On 3/20/08 3:48 PM, "Tim Prins"  wrote:

> Hi Werner,
> 
> Open MPI does things a little bit differently than other MPIs when it
> comes to supporting SLURM. See
> http://www.open-mpi.org/faq/?category=slurm
> for general information about running with Open MPI on SLURM.
> 
> After trying the commands you sent, I am actually a bit surprised by the
> results. I would have expected this mode of operation to work. But
> looking at the environment variables that SLURM is setting for us, I can
> see why it doesn't.
> 
> On a cluster with 4 cores/node, I ran:
> [tprins@odin ~]$ cat mprun.sh
> #!/bin/sh
> printenv
> [tprins@odin ~]$  srun -N 2 -n 2 -b mprun.sh
> srun: jobid 55641 submitted
> [tprins@odin ~]$ cat slurm-55641.out |grep SLURM_TASKS_PER_NODE
> SLURM_TASKS_PER_NODE=4(x2)
> [tprins@odin ~]$
> 
> Which seems to be wrong, since the srun man page says that
> SLURM_TASKS_PER_NODE is the "Number  of tasks to be initiated on each
> node". This seems to imply that the value should be "1(x2)". So maybe
> this is a SLURM problem? If this value were correctly reported, Open MPI
> should work fine for what you wanted to do.
> 
> Two other things:
> 1. You should probably use the command line option '--npernode' for
> mpirun instead of setting the rmaps_base_n_pernode directly.
> 2. In regards to your second example below, Open MPI by default maps 'by
> slot'. That is, it will fill all available slots on the first node
> before moving to the second. You can change this, see:
> 

Re: [OMPI users] SLURM and OpenMPI

2008-03-20 Thread Tim Prins

Hi Werner,

Open MPI does things a little bit differently than other MPIs when it 
comes to supporting SLURM. See

http://www.open-mpi.org/faq/?category=slurm
for general information about running with Open MPI on SLURM.

After trying the commands you sent, I am actually a bit surprised by the 
results. I would have expected this mode of operation to work. But 
looking at the environment variables that SLURM is setting for us, I can 
see why it doesn't.


On a cluster with 4 cores/node, I ran:
[tprins@odin ~]$ cat mprun.sh
#!/bin/sh
printenv
[tprins@odin ~]$  srun -N 2 -n 2 -b mprun.sh
srun: jobid 55641 submitted
[tprins@odin ~]$ cat slurm-55641.out |grep SLURM_TASKS_PER_NODE
SLURM_TASKS_PER_NODE=4(x2)
[tprins@odin ~]$

Which seems to be wrong, since the srun man page says that 
SLURM_TASKS_PER_NODE is the "Number  of tasks to be initiated on each 
node". This seems to imply that the value should be "1(x2)". So maybe 
this is a SLURM problem? If this value were correctly reported, Open MPI 
should work fine for what you wanted to do.


Two other things:
1. You should probably use the command line option '--npernode' for 
mpirun instead of setting the rmaps_base_n_pernode directly.
2. In regards to your second example below, Open MPI by default maps 'by 
slot'. That is, it will fill all available slots on the first node 
before moving to the second. You can change this, see:

http://www.open-mpi.org/faq/?category=running#mpirun-scheduling

I have copied Ralph on this mail to see if he has a better response.

Tim

Werner Augustin wrote:

Hi,

At our site here at the University of Karlsruhe we are running two
large clusters with SLURM and HP-MPI. For our new cluster we want to
keep SLURM and switch to OpenMPI. While testing I got the following
problem:

with HP-MPI I do something like

srun -N 2 -n 2 -b mpirun -srun helloworld

and get 


Hi, here is process 0 of 2, running MPI version 2.0 on xc3n13.
Hi, here is process 1 of 2, running MPI version 2.0 on xc3n14.

when I try the same with OpenMPI (version 1.2.4)

srun -N 2 -n 2 -b mpirun helloworld

I get

Hi, here is process 1 of 8, running MPI version 2.0 on xc3n13.
Hi, here is process 0 of 8, running MPI version 2.0 on xc3n13.
Hi, here is process 5 of 8, running MPI version 2.0 on xc3n14.
Hi, here is process 2 of 8, running MPI version 2.0 on xc3n13.
Hi, here is process 4 of 8, running MPI version 2.0 on xc3n14.
Hi, here is process 3 of 8, running MPI version 2.0 on xc3n13.
Hi, here is process 6 of 8, running MPI version 2.0 on xc3n14.
Hi, here is process 7 of 8, running MPI version 2.0 on xc3n14.

and with 


srun -N 2 -n 2 -b mpirun -np 2 helloworld

Hi, here is process 0 of 2, running MPI version 2.0 on xc3n13.
Hi, here is process 1 of 2, running MPI version 2.0 on xc3n13.

which is still wrong, because it uses only one of the two allocated
nodes.

OpenMPI uses the SLURM_NODELIST and SLURM_TASKS_PER_NODE environment
variables, starts with slurm one orted per node and tasks upto the
maximum number of slots on every node. So basically it also does
some 'resource management' and interferes with slurm. OK, I can fix that
with a mpirun wrapper script which calls mpirun with the right -np and
the right rmaps_base_n_pernode setting, but it gets worse. We want to
allocate computing power on a per cpu base instead of per node, i.e.
different user might share a node. In addition slurm allows to schedule
according to memory usage. Therefore it is important that on every node
there is exactly the number of tasks running that slurm wants. The only
solution I came up with is to generate for every job a detailed
hostfile and call mpirun --hostfile. Any suggestions for improvement?

I've found a discussion thread "slurm and all-srun orterun" in the
mailinglist archive concerning the same problem, where Ralph Castain
announced that he is working on two new launch methods which would fix
my problems. Unfortunately his email address is deleted from the
archive, so it would be really nice if the friendly elf mentioned there
is still around and could forward my mail to him.

Thanks in advance,
Werner Augustin
___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


[OMPI users] SLURM and OpenMPI

2008-03-20 Thread Werner Augustin
Hi,

At our site here at the University of Karlsruhe we are running two
large clusters with SLURM and HP-MPI. For our new cluster we want to
keep SLURM and switch to OpenMPI. While testing I got the following
problem:

with HP-MPI I do something like

srun -N 2 -n 2 -b mpirun -srun helloworld

and get 

Hi, here is process 0 of 2, running MPI version 2.0 on xc3n13.
Hi, here is process 1 of 2, running MPI version 2.0 on xc3n14.

when I try the same with OpenMPI (version 1.2.4)

srun -N 2 -n 2 -b mpirun helloworld

I get

Hi, here is process 1 of 8, running MPI version 2.0 on xc3n13.
Hi, here is process 0 of 8, running MPI version 2.0 on xc3n13.
Hi, here is process 5 of 8, running MPI version 2.0 on xc3n14.
Hi, here is process 2 of 8, running MPI version 2.0 on xc3n13.
Hi, here is process 4 of 8, running MPI version 2.0 on xc3n14.
Hi, here is process 3 of 8, running MPI version 2.0 on xc3n13.
Hi, here is process 6 of 8, running MPI version 2.0 on xc3n14.
Hi, here is process 7 of 8, running MPI version 2.0 on xc3n14.

and with 

srun -N 2 -n 2 -b mpirun -np 2 helloworld

Hi, here is process 0 of 2, running MPI version 2.0 on xc3n13.
Hi, here is process 1 of 2, running MPI version 2.0 on xc3n13.

which is still wrong, because it uses only one of the two allocated
nodes.

OpenMPI uses the SLURM_NODELIST and SLURM_TASKS_PER_NODE environment
variables, starts with slurm one orted per node and tasks upto the
maximum number of slots on every node. So basically it also does
some 'resource management' and interferes with slurm. OK, I can fix that
with a mpirun wrapper script which calls mpirun with the right -np and
the right rmaps_base_n_pernode setting, but it gets worse. We want to
allocate computing power on a per cpu base instead of per node, i.e.
different user might share a node. In addition slurm allows to schedule
according to memory usage. Therefore it is important that on every node
there is exactly the number of tasks running that slurm wants. The only
solution I came up with is to generate for every job a detailed
hostfile and call mpirun --hostfile. Any suggestions for improvement?

I've found a discussion thread "slurm and all-srun orterun" in the
mailinglist archive concerning the same problem, where Ralph Castain
announced that he is working on two new launch methods which would fix
my problems. Unfortunately his email address is deleted from the
archive, so it would be really nice if the friendly elf mentioned there
is still around and could forward my mail to him.

Thanks in advance,
Werner Augustin


Re: [OMPI users] Slurm and Openmpi

2007-01-19 Thread Robert Bicknell

Thanks for the help I renamed the nodes, and now slurm and openmpi seem
to be playing nicely with each other.

Bob

On 1/19/07, Jeff Squyres  wrote:


I think the SLURM code in Open MPI is making an assumption that is
failing in your case: we assume that your nodes will have a specific
naming convention:

mycluster.example.com --> head node
mycluster01.example.com --> cluster node 1
mycluster02.example.com --> cluster node 2
...etc.

OMPI is therefore parsing the SLURM environment and not correctly
groking the "master,wolf1" string because, to be honest, I didn't
even know that SLURM supported that scenario.  I.e., I thought SLURM
required the naming convention I listed above.  In hindsight, that's
a pretty silly assumption, but to be fair, you're the first user that
ever came to us with this problem (i.e., we use pretty much the same
string parsing in LAM/MPI, which has had SLURM support for several
years).  Oops!

We can fix this, but I don't know if it'll make the v1.2 cutoff or
not.  :-\

Thanks for bringing this to our attention!



On Jan 19, 2007, at 1:50 PM, Robert Bicknell wrote:

> Thanks for your response. The program that I have been using for
> testing purposes is a simple hello:
>
>
> #include < stdio.h>
>
> #include 
>
>
> #include 
> #include 
> #include 
> #include 
> main(int argc, char *argv)
> {
>   char name[BUFSIZ];
>   int length;
>   int rank;
>   struct rlimit rlim;
>   FILE *output;
>
>   MPI_Init(, );
>   MPI_Get_processor_name(name, );
>   MPI_Comm_rank(MPI_COMM_WORLD, );
>   rank = 0;
>   MPI_Comm_rank(MPI_COMM_WORLD, );
>
> // while(1) {
>   printf("%s: hello world from rank %d\n", name, rank);
>   sleep(1);
> // }
>   MPI_Finalize();
> }
>
> If I run this program not in a slurm environment I get the following
>
> mpirun -np 4 -mca btl tcp,self -host wolf1,master ./hello
>
> master: hello world from rank 1
> wolf1: hello world from rank 0
> wolf1: hello world from rank 2
> master: hello world from rank 3
>
> This is  exactly what I expect. Now if I create a slurm environment
> using the following:
>
> srun -n 4 -A
>
> The output of printenv|grep SLRUM gives me:
>
> SLURM_NODELIST=master,wolf1
> SLURM_SRUN_COMM_PORT=58929
> SLURM_MEM_BIND_TYPE=
> SLURM_CPU_BIND_VERBOSE=quiet
> SLURM_MEM_BIND_LIST=
> SLURM_CPU_BIND_LIST=
> SLURM_NNODES=2
> SLURM_JOBID=66135
> SLURM_TASKS_PER_NODE=2(x2)
> SLURM_SRUN_COMM_HOST=master
> SLURM_CPU_BIND_TYPE=
> SLURM_MEM_BIND_VERBOSE=quiet
> SLURM_NPROCS=4
>
> This seems to indicate that both master and wolf1 have been
> allocated and that each node should run 2 tasks, which is correct
> since both master and wolf1 are dual processor machines.
>
> Now if I run:
>
> mpirun -np 4 -mca btl tcp,self ./hello
>
> The output is:
>
> master: hello world from rank 1
> master: hello world from rank 2
> master: hello world from rank 3
> master: hello world from rank 0
>
>
> All four processes are running on master and none on wolf1.
>
> If I try the following and specify the hosts. I get the following
> error message.
>
> mpirun -np 4 -host wolf1,master -mca btl tcp,self ./hello
>
> --
> 
> Some of the requested hosts are not included in the current
> allocation for the
> application:
>   ./hello
> The requested hosts were:
>   wolf1,master
>
> Verify that you have mapped the allocated resources properly using the
> --host specification.
> --
> 
> [master:28022] [0,0,0] ORTE_ERROR_LOG: Out of resource in file
> rmgr_urm.c at line 377
> [master:28022] mpirun: spawn failed with errno=-2
>
>
> I'm at a loss to figure out how to get this working correctly. Any
> help would be greatly appreciated.
>
> Bob
>
> On 1/19/07, Ralph Castain < r...@lanl.gov> wrote: Open MPI and SLURM
> should work together just fine right out-of-the-box. The
> typical command progression is:
>
> srun -n x -A
> mpirun -n y .
>
>
> If you are doing those commands and still see everything running on
> the head
> node, then two things could be happening:
>
> (a) you really aren't getting an allocation from slurm. Perhaps you
> don't
> have slurm setup correctly and aren't actually seeing the
> allocation in your
> environment. Do a "printenv | grep SLURM" and see if you find the
> following
> variables:
> SLURM_NPROCS=8
> SLURM_CPU_BIND_VERBOSE=quiet
> SLURM_CPU_BIND_TYPE=
> SLURM_CPU_BIND_LIST=
> SLURM_MEM_BIND_VERBOSE=quiet
> SLURM_MEM_BIND_TYPE=
> SLURM_MEM_BIND_LIST=
> SLURM_JOBID=47225
> SLURM_NNODES=2
> SLURM_NODELIST=odin[013-014]
> SLURM_TASKS_PER_NODE=4(x2)
> SLURM_SRUN_COMM_PORT=43206
> SLURM_SRUN_COMM_HOST=odin
>
> Obviously, the values will be different, but we really need the
> TASKS_PER_NODE and NODELIST ones to be there
>
> (b) the master node is being included in your nodelist and you aren't
> running enough mpi processes to need more nodes (i.e., the number
> of slots
> on the master node is greater than 

Re: [OMPI users] Slurm and Openmpi

2007-01-19 Thread Jeff Squyres
I think the SLURM code in Open MPI is making an assumption that is  
failing in your case: we assume that your nodes will have a specific  
naming convention:


mycluster.example.com --> head node
mycluster01.example.com --> cluster node 1
mycluster02.example.com --> cluster node 2
...etc.

OMPI is therefore parsing the SLURM environment and not correctly  
groking the "master,wolf1" string because, to be honest, I didn't  
even know that SLURM supported that scenario.  I.e., I thought SLURM  
required the naming convention I listed above.  In hindsight, that's  
a pretty silly assumption, but to be fair, you're the first user that  
ever came to us with this problem (i.e., we use pretty much the same  
string parsing in LAM/MPI, which has had SLURM support for several  
years).  Oops!


We can fix this, but I don't know if it'll make the v1.2 cutoff or  
not.  :-\


Thanks for bringing this to our attention!



On Jan 19, 2007, at 1:50 PM, Robert Bicknell wrote:

Thanks for your response. The program that I have been using for  
testing purposes is a simple hello:



#include 

#include 


#include 
#include 
#include 
#include 
main(int argc, char *argv)
{
  char name[BUFSIZ];
  int length;
  int rank;
  struct rlimit rlim;
  FILE *output;

  MPI_Init(, );
  MPI_Get_processor_name(name, );
  MPI_Comm_rank(MPI_COMM_WORLD, );
  rank = 0;
  MPI_Comm_rank(MPI_COMM_WORLD, );

// while(1) {
  printf("%s: hello world from rank %d\n", name, rank);
  sleep(1);
// }
  MPI_Finalize();
}

If I run this program not in a slurm environment I get the following

mpirun -np 4 -mca btl tcp,self -host wolf1,master ./hello

master: hello world from rank 1
wolf1: hello world from rank 0
wolf1: hello world from rank 2
master: hello world from rank 3

This is  exactly what I expect. Now if I create a slurm environment  
using the following:


srun -n 4 -A

The output of printenv|grep SLRUM gives me:

SLURM_NODELIST=master,wolf1
SLURM_SRUN_COMM_PORT=58929
SLURM_MEM_BIND_TYPE=
SLURM_CPU_BIND_VERBOSE=quiet
SLURM_MEM_BIND_LIST=
SLURM_CPU_BIND_LIST=
SLURM_NNODES=2
SLURM_JOBID=66135
SLURM_TASKS_PER_NODE=2(x2)
SLURM_SRUN_COMM_HOST=master
SLURM_CPU_BIND_TYPE=
SLURM_MEM_BIND_VERBOSE=quiet
SLURM_NPROCS=4

This seems to indicate that both master and wolf1 have been  
allocated and that each node should run 2 tasks, which is correct  
since both master and wolf1 are dual processor machines.


Now if I run:

mpirun -np 4 -mca btl tcp,self ./hello

The output is:

master: hello world from rank 1
master: hello world from rank 2
master: hello world from rank 3
master: hello world from rank 0


All four processes are running on master and none on wolf1.

If I try the following and specify the hosts. I get the following  
error message.


mpirun -np 4 -host wolf1,master -mca btl tcp,self ./hello

-- 

Some of the requested hosts are not included in the current  
allocation for the

application:
  ./hello
The requested hosts were:
  wolf1,master

Verify that you have mapped the allocated resources properly using the
--host specification.
-- 

[master:28022] [0,0,0] ORTE_ERROR_LOG: Out of resource in file  
rmgr_urm.c at line 377

[master:28022] mpirun: spawn failed with errno=-2


I'm at a loss to figure out how to get this working correctly. Any  
help would be greatly appreciated.


Bob

On 1/19/07, Ralph Castain  wrote: Open MPI and SLURM  
should work together just fine right out-of-the-box. The

typical command progression is:

srun -n x -A
mpirun -n y .


If you are doing those commands and still see everything running on  
the head

node, then two things could be happening:

(a) you really aren't getting an allocation from slurm. Perhaps you  
don't
have slurm setup correctly and aren't actually seeing the  
allocation in your
environment. Do a "printenv | grep SLURM" and see if you find the  
following

variables:
SLURM_NPROCS=8
SLURM_CPU_BIND_VERBOSE=quiet
SLURM_CPU_BIND_TYPE=
SLURM_CPU_BIND_LIST=
SLURM_MEM_BIND_VERBOSE=quiet
SLURM_MEM_BIND_TYPE=
SLURM_MEM_BIND_LIST=
SLURM_JOBID=47225
SLURM_NNODES=2
SLURM_NODELIST=odin[013-014]
SLURM_TASKS_PER_NODE=4(x2)
SLURM_SRUN_COMM_PORT=43206
SLURM_SRUN_COMM_HOST=odin

Obviously, the values will be different, but we really need the
TASKS_PER_NODE and NODELIST ones to be there

(b) the master node is being included in your nodelist and you aren't
running enough mpi processes to need more nodes (i.e., the number  
of slots
on the master node is greater than or equal to the num procs you  
requested).

You can force Open MPI to not run on your master node by including
"--nolocal" on your command line.

Of course, if the master node is the only thing on the nodelist,  
this will

cause mpirun to abort as there is nothing else for us to use.

Hope that helps
Ralph


On 1/18/07 11:03 PM, "Robert Bicknell"  wrote:

> I'm 

Re: [OMPI users] Slurm and Openmpi

2007-01-19 Thread Ralph Castain
Open MPI and SLURM should work together just fine right out-of-the-box. The
typical command progression is:

srun -n x -A
mpirun -n y .


If you are doing those commands and still see everything running on the head
node, then two things could be happening:

(a) you really aren't getting an allocation from slurm. Perhaps you don't
have slurm setup correctly and aren't actually seeing the allocation in your
environment. Do a "printenv | grep SLURM" and see if you find the following
variables:
SLURM_NPROCS=8
SLURM_CPU_BIND_VERBOSE=quiet
SLURM_CPU_BIND_TYPE=
SLURM_CPU_BIND_LIST=
SLURM_MEM_BIND_VERBOSE=quiet
SLURM_MEM_BIND_TYPE=
SLURM_MEM_BIND_LIST=
SLURM_JOBID=47225
SLURM_NNODES=2
SLURM_NODELIST=odin[013-014]
SLURM_TASKS_PER_NODE=4(x2)
SLURM_SRUN_COMM_PORT=43206
SLURM_SRUN_COMM_HOST=odin

Obviously, the values will be different, but we really need the
TASKS_PER_NODE and NODELIST ones to be there

(b) the master node is being included in your nodelist and you aren't
running enough mpi processes to need more nodes (i.e., the number of slots
on the master node is greater than or equal to the num procs you requested).
You can force Open MPI to not run on your master node by including
"--nolocal" on your command line.

Of course, if the master node is the only thing on the nodelist, this will
cause mpirun to abort as there is nothing else for us to use.

Hope that helps
Ralph


On 1/18/07 11:03 PM, "Robert Bicknell"  wrote:

> I'm trying to get slurm and openmpi to work together on a debian, two
> node cluster.  Slurm and openmpi seem to work fine seperately, but when
> I try to run a mpi program in a slurm allocation, all the processes get
> run on the master node, and not distributed to the second node. What am
> I doing wrong?
> 
> Bob
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users




[OMPI users] Slurm and Openmpi

2007-01-19 Thread Robert Bicknell
I'm trying to get slurm and openmpi to work together on a debian, two 
node cluster.  Slurm and openmpi seem to work fine seperately, but when 
I try to run a mpi program in a slurm allocation, all the processes get 
run on the master node, and not distributed to the second node. What am 
I doing wrong?


Bob