Re: [OMPI users] More on AlltoAll

2008-03-20 Thread Terry Frankcombe
If the data distribution was sufficiently predictable and long-lived
through the life of the application, could one not define new
communicators to clean up the calls?


> After reading the previous discussion on AllReduce and AlltoAll, I
> thought I would ask my question. I have a case where I have data
> unevenly distributed among the processes (unevenly means that the
> processes have differing amounts of data) that I need to globally
> redistribute, resulting in a different uneven distribution. Writing the
> code to do the redistribution using AlltoAll is straightforward.
>
> The problem though is that there are often special cases where each
> process only needs to exchange data with it neighbors. So the question
> is that when two processors don't have data to exchange, is the OpenMPI
> AlltoAll is written in such a way so that they don't do any
> communication? Will the AlltoAll be as efficient (or at least nearly as
> efficient) as direct send/recv among neighbors?
>   Thanks!
> Dave
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>




Re: [OMPI users] SLURM and OpenMPI

2008-03-20 Thread Ralph Castain
Hi there

I am no slurm expert. However, it is our understanding that
SLURM_TASKS_PER_NODE means the number of slots allocated to the job, not the
number of tasks to be executed on each node. So the 4(x2) tells us that we
have 4 slots on each of two nodes to work with. You got 4 slots on each node
because you used the -N option, which told slurm to assign all slots on that
node to this job - I assume you have 4 processors on your nodes. OpenMPI
parses that string to get the allocation, then maps the number of specified
processes against it.

It is possible that the interpretation of SLURM_TASKS_PER_NODE is different
when used to allocate as opposed to directly launch processes. Our typical
usage is for someone to do:

srun -N 2 -A
mpirun -np 2 helloworld

In other words, we use srun to create an allocation, and then run mpirun
separately within it.


I am therefore unsure what the "-n 2" will do here. If I believe the
documentation, it would seem to imply that srun will attempt to launch two
copies of "mpirun -np 2 helloworld", yet your output doesn't seem to support
that interpretation. It would appear that the "-n 2" is being ignored and
only one copy of mpirun is being launched. I'm no slurm expert, so perhaps
that interpretation is incorrect.

Assuming that the -n 2 is ignored in this situation, your command line:

> srun -N 2 -n 2 -b mpirun -np 2 helloworld

will cause mpirun to launch two processes, mapped byslot against the slurm
allocation of two nodes, each having 4 slots. Thus, both processes will be
launched on the first node, which is what you observed.

Similarly, the command line

> srun -N 2 -n 2 -b mpirun helloworld

doesn't specify the #procs to mpirun. In that case, mpirun will launch a
process on every available slot in the allocation. Given this command, that
means 4 procs will be launched on each of the 2 nodes, for a total of 8
procs. Ranks 0-3 will be placed on the first node, ranks 4-7 on the second.
Again, this is what you observed.

I don't know if I would say we "interfere" with SLURM - I would say that we
are only lightly integrated with SLURM at this time. We use SLURM as a
resource manager to assign nodes, and then map processes onto those nodes
according to the user's wishes. We chose to do this because srun applies its
own load balancing algorithms if you launch processes directly with it,
which leaves the user with little flexibility to specify their desired
rank/slot mapping. We chose to support the greater flexibility.

Using the SLURM-defined mapping will require launching without our mpirun.
This capability is still under development, and there are issues with doing
that in slurm environments which need to be addressed. It is at a lower
priority than providing such support for TM right now, so I wouldn't expect
it to become available for several months at least.

Alternatively, it may be possible for mpirun to get the SLURM-defined
mapping and use it to launch the processes. If we can get it somehow, there
is no problem launching it as specified - the problem is how to get the map!
Unfortunately, slurm's licensing prevents us from using its internal APIs,
so obtaining the map is not an easy thing to do.

Anyone who wants to help accelerate that timetable is welcome to contact me.
We know the technical issues - this is mostly a problem of (a) priorities
versus my available time, and (b) similar considerations on the part of the
slurm folks to do the work themselves.

Ralph


On 3/20/08 3:48 PM, "Tim Prins"  wrote:

> Hi Werner,
> 
> Open MPI does things a little bit differently than other MPIs when it
> comes to supporting SLURM. See
> http://www.open-mpi.org/faq/?category=slurm
> for general information about running with Open MPI on SLURM.
> 
> After trying the commands you sent, I am actually a bit surprised by the
> results. I would have expected this mode of operation to work. But
> looking at the environment variables that SLURM is setting for us, I can
> see why it doesn't.
> 
> On a cluster with 4 cores/node, I ran:
> [tprins@odin ~]$ cat mprun.sh
> #!/bin/sh
> printenv
> [tprins@odin ~]$  srun -N 2 -n 2 -b mprun.sh
> srun: jobid 55641 submitted
> [tprins@odin ~]$ cat slurm-55641.out |grep SLURM_TASKS_PER_NODE
> SLURM_TASKS_PER_NODE=4(x2)
> [tprins@odin ~]$
> 
> Which seems to be wrong, since the srun man page says that
> SLURM_TASKS_PER_NODE is the "Number  of tasks to be initiated on each
> node". This seems to imply that the value should be "1(x2)". So maybe
> this is a SLURM problem? If this value were correctly reported, Open MPI
> should work fine for what you wanted to do.
> 
> Two other things:
> 1. You should probably use the command line option '--npernode' for
> mpirun instead of setting the rmaps_base_n_pernode directly.
> 2. In regards to your second example below, Open MPI by default maps 'by
> slot'. That is, it will fill all available slots on the first node
> before moving to the second. You can change this, see:
> 

Re: [OMPI users] SLURM and OpenMPI

2008-03-20 Thread Tim Prins

Hi Werner,

Open MPI does things a little bit differently than other MPIs when it 
comes to supporting SLURM. See

http://www.open-mpi.org/faq/?category=slurm
for general information about running with Open MPI on SLURM.

After trying the commands you sent, I am actually a bit surprised by the 
results. I would have expected this mode of operation to work. But 
looking at the environment variables that SLURM is setting for us, I can 
see why it doesn't.


On a cluster with 4 cores/node, I ran:
[tprins@odin ~]$ cat mprun.sh
#!/bin/sh
printenv
[tprins@odin ~]$  srun -N 2 -n 2 -b mprun.sh
srun: jobid 55641 submitted
[tprins@odin ~]$ cat slurm-55641.out |grep SLURM_TASKS_PER_NODE
SLURM_TASKS_PER_NODE=4(x2)
[tprins@odin ~]$

Which seems to be wrong, since the srun man page says that 
SLURM_TASKS_PER_NODE is the "Number  of tasks to be initiated on each 
node". This seems to imply that the value should be "1(x2)". So maybe 
this is a SLURM problem? If this value were correctly reported, Open MPI 
should work fine for what you wanted to do.


Two other things:
1. You should probably use the command line option '--npernode' for 
mpirun instead of setting the rmaps_base_n_pernode directly.
2. In regards to your second example below, Open MPI by default maps 'by 
slot'. That is, it will fill all available slots on the first node 
before moving to the second. You can change this, see:

http://www.open-mpi.org/faq/?category=running#mpirun-scheduling

I have copied Ralph on this mail to see if he has a better response.

Tim

Werner Augustin wrote:

Hi,

At our site here at the University of Karlsruhe we are running two
large clusters with SLURM and HP-MPI. For our new cluster we want to
keep SLURM and switch to OpenMPI. While testing I got the following
problem:

with HP-MPI I do something like

srun -N 2 -n 2 -b mpirun -srun helloworld

and get 


Hi, here is process 0 of 2, running MPI version 2.0 on xc3n13.
Hi, here is process 1 of 2, running MPI version 2.0 on xc3n14.

when I try the same with OpenMPI (version 1.2.4)

srun -N 2 -n 2 -b mpirun helloworld

I get

Hi, here is process 1 of 8, running MPI version 2.0 on xc3n13.
Hi, here is process 0 of 8, running MPI version 2.0 on xc3n13.
Hi, here is process 5 of 8, running MPI version 2.0 on xc3n14.
Hi, here is process 2 of 8, running MPI version 2.0 on xc3n13.
Hi, here is process 4 of 8, running MPI version 2.0 on xc3n14.
Hi, here is process 3 of 8, running MPI version 2.0 on xc3n13.
Hi, here is process 6 of 8, running MPI version 2.0 on xc3n14.
Hi, here is process 7 of 8, running MPI version 2.0 on xc3n14.

and with 


srun -N 2 -n 2 -b mpirun -np 2 helloworld

Hi, here is process 0 of 2, running MPI version 2.0 on xc3n13.
Hi, here is process 1 of 2, running MPI version 2.0 on xc3n13.

which is still wrong, because it uses only one of the two allocated
nodes.

OpenMPI uses the SLURM_NODELIST and SLURM_TASKS_PER_NODE environment
variables, starts with slurm one orted per node and tasks upto the
maximum number of slots on every node. So basically it also does
some 'resource management' and interferes with slurm. OK, I can fix that
with a mpirun wrapper script which calls mpirun with the right -np and
the right rmaps_base_n_pernode setting, but it gets worse. We want to
allocate computing power on a per cpu base instead of per node, i.e.
different user might share a node. In addition slurm allows to schedule
according to memory usage. Therefore it is important that on every node
there is exactly the number of tasks running that slurm wants. The only
solution I came up with is to generate for every job a detailed
hostfile and call mpirun --hostfile. Any suggestions for improvement?

I've found a discussion thread "slurm and all-srun orterun" in the
mailinglist archive concerning the same problem, where Ralph Castain
announced that he is working on two new launch methods which would fix
my problems. Unfortunately his email address is deleted from the
archive, so it would be really nice if the friendly elf mentioned there
is still around and could forward my mail to him.

Thanks in advance,
Werner Augustin
___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


Re: [OMPI users] More on AlltoAll

2008-03-20 Thread Dave Grote





Sorry - my mistake - I meant AlltoAllV, which is what I use in my code.

Ashley Pittman wrote:

  On Thu, 2008-03-20 at 10:27 -0700, Dave Grote wrote:
  
  
After reading the previous discussion on AllReduce and AlltoAll, I 
thought I would ask my question. I have a case where I have data 
unevenly distributed among the processes (unevenly means that the 
processes have differing amounts of data) that I need to globally 
redistribute, resulting in a different uneven distribution. Writing the 
code to do the redistribution using AlltoAll is straightforward.

The problem though is that there are often special cases where each 
process only needs to exchange data with it neighbors. So the question 
is that when two processors don't have data to exchange, is the OpenMPI 
AlltoAll is written in such a way so that they don't do any 
communication? Will the AlltoAll be as efficient (or at least nearly as 
efficient) as direct send/recv among neighbors?

  
  
AlltoAll takes a single size of message and communictes that amount of
data from everybody to everybody.  You might want to look at AlltoAllw
and AlltoAllv, neither of which I have any experience of however.

Ashley,

___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

  





Re: [OMPI users] More on AlltoAll

2008-03-20 Thread Ashley Pittman

On Thu, 2008-03-20 at 10:27 -0700, Dave Grote wrote:
> After reading the previous discussion on AllReduce and AlltoAll, I 
> thought I would ask my question. I have a case where I have data 
> unevenly distributed among the processes (unevenly means that the 
> processes have differing amounts of data) that I need to globally 
> redistribute, resulting in a different uneven distribution. Writing the 
> code to do the redistribution using AlltoAll is straightforward.
> 
> The problem though is that there are often special cases where each 
> process only needs to exchange data with it neighbors. So the question 
> is that when two processors don't have data to exchange, is the OpenMPI 
> AlltoAll is written in such a way so that they don't do any 
> communication? Will the AlltoAll be as efficient (or at least nearly as 
> efficient) as direct send/recv among neighbors?

AlltoAll takes a single size of message and communictes that amount of
data from everybody to everybody.  You might want to look at AlltoAllw
and AlltoAllv, neither of which I have any experience of however.

Ashley,



[OMPI users] More on AlltoAll

2008-03-20 Thread Dave Grote


After reading the previous discussion on AllReduce and AlltoAll, I 
thought I would ask my question. I have a case where I have data 
unevenly distributed among the processes (unevenly means that the 
processes have differing amounts of data) that I need to globally 
redistribute, resulting in a different uneven distribution. Writing the 
code to do the redistribution using AlltoAll is straightforward.


The problem though is that there are often special cases where each 
process only needs to exchange data with it neighbors. So the question 
is that when two processors don't have data to exchange, is the OpenMPI 
AlltoAll is written in such a way so that they don't do any 
communication? Will the AlltoAll be as efficient (or at least nearly as 
efficient) as direct send/recv among neighbors?

 Thanks!
   Dave


[OMPI users] Unexpected compile error setting FILE_NULL Errhandler using C++ Bindings

2008-03-20 Thread Eidson, Eric D
Hello,

OpenMPI 1.2.5 and earlier do not let you set the Errhandler for
MPI::FILE_NULL using the C++ bindings.

[You would want to do so because, on error, MPI::File::Open() and
MPI::File::Delete() call the Errhandler associated with FILE_NULL.]

With the C++ bindings, MPI::FILE_NULL is a const object, and Set_errhandler
is apparently a non-const function -- so compiling fails.

Eric

--

#include 

int
main()
{
  MPI::Init();

  MPI::Errhandler oldeh = MPI::FILE_NULL.Get_errhandler();
  MPI::FILE_NULL.Set_errhandler(MPI::ERRORS_THROW_EXCEPTIONS);
  MPI::FILE_NULL.Set_errhandler(oldeh);

  MPI::Finalize();
}





[OMPI users] SLURM and OpenMPI

2008-03-20 Thread Werner Augustin
Hi,

At our site here at the University of Karlsruhe we are running two
large clusters with SLURM and HP-MPI. For our new cluster we want to
keep SLURM and switch to OpenMPI. While testing I got the following
problem:

with HP-MPI I do something like

srun -N 2 -n 2 -b mpirun -srun helloworld

and get 

Hi, here is process 0 of 2, running MPI version 2.0 on xc3n13.
Hi, here is process 1 of 2, running MPI version 2.0 on xc3n14.

when I try the same with OpenMPI (version 1.2.4)

srun -N 2 -n 2 -b mpirun helloworld

I get

Hi, here is process 1 of 8, running MPI version 2.0 on xc3n13.
Hi, here is process 0 of 8, running MPI version 2.0 on xc3n13.
Hi, here is process 5 of 8, running MPI version 2.0 on xc3n14.
Hi, here is process 2 of 8, running MPI version 2.0 on xc3n13.
Hi, here is process 4 of 8, running MPI version 2.0 on xc3n14.
Hi, here is process 3 of 8, running MPI version 2.0 on xc3n13.
Hi, here is process 6 of 8, running MPI version 2.0 on xc3n14.
Hi, here is process 7 of 8, running MPI version 2.0 on xc3n14.

and with 

srun -N 2 -n 2 -b mpirun -np 2 helloworld

Hi, here is process 0 of 2, running MPI version 2.0 on xc3n13.
Hi, here is process 1 of 2, running MPI version 2.0 on xc3n13.

which is still wrong, because it uses only one of the two allocated
nodes.

OpenMPI uses the SLURM_NODELIST and SLURM_TASKS_PER_NODE environment
variables, starts with slurm one orted per node and tasks upto the
maximum number of slots on every node. So basically it also does
some 'resource management' and interferes with slurm. OK, I can fix that
with a mpirun wrapper script which calls mpirun with the right -np and
the right rmaps_base_n_pernode setting, but it gets worse. We want to
allocate computing power on a per cpu base instead of per node, i.e.
different user might share a node. In addition slurm allows to schedule
according to memory usage. Therefore it is important that on every node
there is exactly the number of tasks running that slurm wants. The only
solution I came up with is to generate for every job a detailed
hostfile and call mpirun --hostfile. Any suggestions for improvement?

I've found a discussion thread "slurm and all-srun orterun" in the
mailinglist archive concerning the same problem, where Ralph Castain
announced that he is working on two new launch methods which would fix
my problems. Unfortunately his email address is deleted from the
archive, so it would be really nice if the friendly elf mentioned there
is still around and could forward my mail to him.

Thanks in advance,
Werner Augustin