Re: [OMPI users] what was the rationale behind rank mapping by socket?

2016-11-07 Thread Dave Love
"r...@open-mpi.org"  writes:

> Yes, I’ve been hearing a growing number of complaints about cgroups for that 
> reason. Our mapping/ranking/binding options will work with the cgroup 
> envelope, but it generally winds up with a result that isn’t what the user 
> wanted or expected.

How?  I don't understand as an implementor why there's a difference from
just resource manager core binding, assuming the programs don't try to
escape the binding.  (I'm not saying there's nothing wrong with cgroups
in general...)

> We always post the OMPI BoF slides on our web site, and we’ll do the same 
> this year. I may try to record webcast on it and post that as well since I 
> know it can be confusing given all the flexibility we expose.
>
> In case you haven’t read it yet, here is the relevant section from “man 
> mpirun”:

I'm afraid I read that, and various versions of the code at different
times, and I've worked on resource manager core binding.  I still had to
experiment to find a way to run mpi+openmp jobs correctly, in multiple
ompi versions.  NEWS usually doesn't help, nor conference talks for
people who aren't there and don't know they should search beyond the
documentation.  We don't even seem to be able to make reliable bug
reports as they may or may not get picked up here.

Regardless, I can't see how binding to socket can be a good default.
___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Re: [OMPI users] what was the rationale behind rank mapping by socket?

2016-10-29 Thread Bennet Fauber
Thanks, Ralph,

A video would be great to accompany the slides!

I hope you have a good and productive SC16.

-- bennet



On Fri, Oct 28, 2016 at 8:40 PM, r...@open-mpi.org  wrote:
> Yes, I’ve been hearing a growing number of complaints about cgroups for that 
> reason. Our mapping/ranking/binding options will work with the cgroup 
> envelope, but it generally winds up with a result that isn’t what the user 
> wanted or expected.
>
> We always post the OMPI BoF slides on our web site, and we’ll do the same 
> this year. I may try to record webcast on it and post that as well since I 
> know it can be confusing given all the flexibility we expose.
>
> In case you haven’t read it yet, here is the relevant section from “man 
> mpirun”:
>
>  Mapping, Ranking, and Binding: Oh My!
>Open MPI employs a three-phase procedure for assigning process 
> locations and ranks:
>
>mapping   Assigns a default location to each process
>
>ranking   Assigns an MPI_COMM_WORLD rank value to each process
>
>binding   Constrains each process to run on specific processors
>
>The mapping step is used to assign a default location to each process 
> based on the mapper being employed. Mapping by slot, node,  and  sequentially 
>  results  in  the
>assignment of the processes to the node level. In contrast, mapping by 
> object, allows the mapper to assign the process to an actual object on each 
> node.
>
>Note: the location assigned to the process is independent of where it 
> will be bound - the assignment is used solely as input to the binding 
> algorithm.
>
>The  mapping of process processes to nodes can be defined not just 
> with general policies but also, if necessary, using arbitrary mappings that 
> cannot be described by
>a simple policy.  One can use the "sequential mapper," which reads the 
> hostfile line by line, assigning processes to nodes in whatever order the 
> hostfile  specifies.
>Use the -mca rmaps seq option.  For example, using the same hostfile 
> as before:
>
>mpirun -hostfile myhostfile -mca rmaps seq ./a.out
>
>will  launch three processes, one on each of nodes aa, bb, and cc, 
> respectively.  The slot counts don't matter;  one process is launched per 
> line on whatever node is
>listed on the line.
>
>Another way to specify arbitrary mappings is with a rankfile, which 
> gives you detailed control over process binding as well.  Rankfiles are 
> discussed below.
>
>The second phase focuses on the ranking of the process within the 
> job's MPI_COMM_WORLD.  Open MPI separates this from the mapping procedure to 
> allow more flexibility
>in the relative placement of MPI processes. This is best illustrated 
> by considering the following two cases where we used the —map-by ppr:2:socket 
> option:
>
>  node aa   node bb
>
>rank-by core 0 1 ! 2 3 4 5 ! 6 7
>
>   rank-by socket0 2 ! 1 3 4 6 ! 5 7
>
>   rank-by socket:span   0 4 ! 1 5 2 6 ! 3 7
>
>Ranking  by core and by slot provide the identical result - a simple 
> progression of MPI_COMM_WORLD ranks across each node. Ranking by socket does 
> a round-robin rank‐
>ing within each node until all processes have been assigned an MCW 
> rank, and then progresses to the next node. Adding the span  modifier  to  
> the  ranking  directive
>causes  the  ranking algorithm to treat the entire allocation as a 
> single entity - thus, the MCW ranks are assigned across all sockets before 
> circling back around to
>the beginning.
>
>The binding phase actually binds each process to a given set of 
> processors. This can improve performance if the operating system is placing  
> processes  suboptimally.
>For  example,  it  might  oversubscribe  some  multi-core processor 
> sockets, leaving other sockets idle;  this can lead processes to contend 
> unnecessarily for common
>resources.  Or, it might spread processes out too widely;  this can be 
> suboptimal if application performance is sensitive to interprocess 
> communication costs.  Bind‐
>ing can also keep the operating system from migrating processes 
> excessively, regardless of how optimally those processes were placed to begin 
> with.
>
>The  processors  to  be  used  for binding can be identified in terms 
> of topological groupings - e.g., binding to an l3cache will bind each process 
> to all processors
>within the scope of a single L3 cache within their assigned location. 
> Thus, if a process is assigned by the mapper to a  certain  socket,  then  a  
> —bind-to  l3cache
>directive will cause the process to be bound to the processors that 
> share a single L3 cache within that socket.
>
>To  help  balance loads, the binding directive uses a round-robin 
> method when binding to 

Re: [OMPI users] what was the rationale behind rank mapping by socket?

2016-10-28 Thread r...@open-mpi.org
Yes, I’ve been hearing a growing number of complaints about cgroups for that 
reason. Our mapping/ranking/binding options will work with the cgroup envelope, 
but it generally winds up with a result that isn’t what the user wanted or 
expected.

We always post the OMPI BoF slides on our web site, and we’ll do the same this 
year. I may try to record webcast on it and post that as well since I know it 
can be confusing given all the flexibility we expose.

In case you haven’t read it yet, here is the relevant section from “man mpirun”:

 Mapping, Ranking, and Binding: Oh My!
   Open MPI employs a three-phase procedure for assigning process locations 
and ranks:

   mapping   Assigns a default location to each process

   ranking   Assigns an MPI_COMM_WORLD rank value to each process

   binding   Constrains each process to run on specific processors

   The mapping step is used to assign a default location to each process 
based on the mapper being employed. Mapping by slot, node,  and  sequentially  
results  in  the
   assignment of the processes to the node level. In contrast, mapping by 
object, allows the mapper to assign the process to an actual object on each 
node.

   Note: the location assigned to the process is independent of where it 
will be bound - the assignment is used solely as input to the binding algorithm.

   The  mapping of process processes to nodes can be defined not just with 
general policies but also, if necessary, using arbitrary mappings that cannot 
be described by
   a simple policy.  One can use the "sequential mapper," which reads the 
hostfile line by line, assigning processes to nodes in whatever order the 
hostfile  specifies.
   Use the -mca rmaps seq option.  For example, using the same hostfile as 
before:

   mpirun -hostfile myhostfile -mca rmaps seq ./a.out

   will  launch three processes, one on each of nodes aa, bb, and cc, 
respectively.  The slot counts don't matter;  one process is launched per line 
on whatever node is
   listed on the line.

   Another way to specify arbitrary mappings is with a rankfile, which 
gives you detailed control over process binding as well.  Rankfiles are 
discussed below.

   The second phase focuses on the ranking of the process within the job's 
MPI_COMM_WORLD.  Open MPI separates this from the mapping procedure to allow 
more flexibility
   in the relative placement of MPI processes. This is best illustrated by 
considering the following two cases where we used the —map-by ppr:2:socket 
option:

 node aa   node bb

   rank-by core 0 1 ! 2 3 4 5 ! 6 7

  rank-by socket0 2 ! 1 3 4 6 ! 5 7

  rank-by socket:span   0 4 ! 1 5 2 6 ! 3 7

   Ranking  by core and by slot provide the identical result - a simple 
progression of MPI_COMM_WORLD ranks across each node. Ranking by socket does a 
round-robin rank‐
   ing within each node until all processes have been assigned an MCW rank, 
and then progresses to the next node. Adding the span  modifier  to  the  
ranking  directive
   causes  the  ranking algorithm to treat the entire allocation as a 
single entity - thus, the MCW ranks are assigned across all sockets before 
circling back around to
   the beginning.

   The binding phase actually binds each process to a given set of 
processors. This can improve performance if the operating system is placing  
processes  suboptimally.
   For  example,  it  might  oversubscribe  some  multi-core processor 
sockets, leaving other sockets idle;  this can lead processes to contend 
unnecessarily for common
   resources.  Or, it might spread processes out too widely;  this can be 
suboptimal if application performance is sensitive to interprocess 
communication costs.  Bind‐
   ing can also keep the operating system from migrating processes 
excessively, regardless of how optimally those processes were placed to begin 
with.

   The  processors  to  be  used  for binding can be identified in terms of 
topological groupings - e.g., binding to an l3cache will bind each process to 
all processors
   within the scope of a single L3 cache within their assigned location. 
Thus, if a process is assigned by the mapper to a  certain  socket,  then  a  
—bind-to  l3cache
   directive will cause the process to be bound to the processors that 
share a single L3 cache within that socket.

   To  help  balance loads, the binding directive uses a round-robin method 
when binding to levels lower than used in the mapper. For example, consider the 
case where a
   job is mapped to the socket level, and then bound to core. Each socket 
will have multiple cores, so if multiple processes are mapped to a given 
socket,  the  binding
   algorithm will assign each process located to a socket to a unique core 
in a round-robin manner.

   Alternatively,  

Re: [OMPI users] what was the rationale behind rank mapping by socket?

2016-10-28 Thread Bennet Fauber
Ralph,

Alas, I will not be at SC16.  I would like to hear and/or see what you
present, so if it gets made available in alternate format, I'd
appreciated know where and how to get it.

I am more and more coming to think that our cluster configuration is
essentially designed to frustrated MPI developers because we use the
scheduler to create cgroups (once upon a time, cpusets) for subsets of
cores on multisocket machines, and I think that invalidates a lot of
the assumptions that are getting made by people who want to bind to
particular patters.

It's our foot, and we have been doing a good job of shooting it.  ;-)

-- bennet




On Fri, Oct 28, 2016 at 7:18 PM, r...@open-mpi.org  wrote:
> FWIW: I’ll be presenting “Mapping, Ranking, and Binding - Oh My!” at the
> OMPI BoF meeting at SC’16, for those who can attend. Will try to explain the
> rationale as well as the mechanics of the options
>
> On Oct 11, 2016, at 8:09 AM, Dave Love  wrote:
>
> Gilles Gouaillardet  writes:
>
> Bennet,
>
>
> my guess is mapping/binding to sockets was deemed the best compromise
> from an
>
> "out of the box" performance point of view.
>
>
> iirc, we did fix some bugs that occured when running under asymmetric
> cpusets/cgroups.
>
> if you still have some issues with the latest Open MPI version (2.0.1)
> and the default policy,
>
> could you please describe them ?
>
>
> I also don't understand why binding to sockets is the right thing to do.
> Binding to cores seems the right default to me, and I set that locally,
> with instructions about running OpenMP.  (Isn't that what other
> implementations do, which makes them look better?)
>
> I think at least numa should be used, rather than socket.  Knights
> Landing, for instance, is single-socket, so no gets no actual binding by
> default.
> ___
> users mailing list
> users@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
>
>
>
> ___
> users mailing list
> users@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Re: [OMPI users] what was the rationale behind rank mapping by socket?

2016-10-28 Thread r...@open-mpi.org
FWIW: I’ll be presenting “Mapping, Ranking, and Binding - Oh My!” at the OMPI 
BoF meeting at SC’16, for those who can attend. Will try to explain the 
rationale as well as the mechanics of the options

> On Oct 11, 2016, at 8:09 AM, Dave Love  wrote:
> 
> Gilles Gouaillardet > writes:
> 
>> Bennet,
>> 
>> 
>> my guess is mapping/binding to sockets was deemed the best compromise
>> from an
>> 
>> "out of the box" performance point of view.
>> 
>> 
>> iirc, we did fix some bugs that occured when running under asymmetric
>> cpusets/cgroups.
>> 
>> if you still have some issues with the latest Open MPI version (2.0.1)
>> and the default policy,
>> 
>> could you please describe them ?
> 
> I also don't understand why binding to sockets is the right thing to do.
> Binding to cores seems the right default to me, and I set that locally,
> with instructions about running OpenMP.  (Isn't that what other
> implementations do, which makes them look better?)
> 
> I think at least numa should be used, rather than socket.  Knights
> Landing, for instance, is single-socket, so no gets no actual binding by
> default.
> ___
> users mailing list
> users@lists.open-mpi.org 
> https://rfd.newmexicoconsortium.org/mailman/listinfo/users 
> 
___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Re: [OMPI users] what was the rationale behind rank mapping by socket?

2016-10-11 Thread Dave Love
Gilles Gouaillardet  writes:

> Bennet,
>
>
> my guess is mapping/binding to sockets was deemed the best compromise
> from an
>
> "out of the box" performance point of view.
>
>
> iirc, we did fix some bugs that occured when running under asymmetric
> cpusets/cgroups.
>
> if you still have some issues with the latest Open MPI version (2.0.1)
> and the default policy,
>
> could you please describe them ?

I also don't understand why binding to sockets is the right thing to do.
Binding to cores seems the right default to me, and I set that locally,
with instructions about running OpenMP.  (Isn't that what other
implementations do, which makes them look better?)

I think at least numa should be used, rather than socket.  Knights
Landing, for instance, is single-socket, so no gets no actual binding by
default.
___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users


Re: [OMPI users] what was the rationale behind rank mapping by socket?

2016-09-29 Thread Gilles Gouaillardet

Bennet,


my guess is mapping/binding to sockets was deemed the best compromise 
from an


"out of the box" performance point of view.


iirc, we did fix some bugs that occured when running under asymmetric 
cpusets/cgroups.


if you still have some issues with the latest Open MPI version (2.0.1) 
and the default policy,


could you please describe them ?


Cheers,


Gilles


On 9/30/2016 10:55 AM, Bennet Fauber wrote:

Pardon my naivete, but why is bind-to-none not the default, and if the
user wants to specify something, they can then get into trouble
knowingly?  We have had all manner of problems with binding when using
cpusets/cgroups.

-- bennet



On Thu, Sep 29, 2016 at 9:52 PM, Gilles Gouaillardet  wrote:

David,


i guess you would have expected the default mapping/binding scheme is core
instead of sockets

iirc, we decided *not* to bind to cores by default because it is "safer"


if you simply
OMP_NUM_THREADS=8 mpirun -np 2 a.out

then, a default mapping/binding scheme by core means the OpenMP threads end
up doing time sharing.

this is an honest mistake (8 cores per task were not requested), so having a
default mapping/binding scheme by socket means

OpenMP threads are spread on the socket and will likely not do time sharing.

/* if you run on a single socket, or if you run 4 tasks on a dual socket
nodes, then (some) tasks do share the socket,

and depending on how the OpenMP runtime is implemented, two threads of two
distinct tasks could end up bound/running on the same core */


Cheers,


Gilles


On 9/30/2016 3:04 AM, David Shrader wrote:

Hello All,

Would anyone know why the default mapping scheme is socket for jobs with
more than 2 ranks? Would they be able to please take some time and explain
the reasoning? Please note I am not railing against the decision, but rather
trying to gather as much information about it as I can so as to be able to
better work with my users who are just now starting to ask questions about
it. The FAQ pretty much pushes folks to the man pages, and the mpirun man
page doesn't go in to the reasoning.

Thank you for your time,
David


___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users



___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users


Re: [OMPI users] what was the rationale behind rank mapping by socket?

2016-09-29 Thread Bennet Fauber
Pardon my naivete, but why is bind-to-none not the default, and if the
user wants to specify something, they can then get into trouble
knowingly?  We have had all manner of problems with binding when using
cpusets/cgroups.

-- bennet



On Thu, Sep 29, 2016 at 9:52 PM, Gilles Gouaillardet  wrote:
> David,
>
>
> i guess you would have expected the default mapping/binding scheme is core
> instead of sockets
>
> iirc, we decided *not* to bind to cores by default because it is "safer"
>
>
> if you simply
> OMP_NUM_THREADS=8 mpirun -np 2 a.out
>
> then, a default mapping/binding scheme by core means the OpenMP threads end
> up doing time sharing.
>
> this is an honest mistake (8 cores per task were not requested), so having a
> default mapping/binding scheme by socket means
>
> OpenMP threads are spread on the socket and will likely not do time sharing.
>
> /* if you run on a single socket, or if you run 4 tasks on a dual socket
> nodes, then (some) tasks do share the socket,
>
> and depending on how the OpenMP runtime is implemented, two threads of two
> distinct tasks could end up bound/running on the same core */
>
>
> Cheers,
>
>
> Gilles
>
>
> On 9/30/2016 3:04 AM, David Shrader wrote:
>>
>> Hello All,
>>
>> Would anyone know why the default mapping scheme is socket for jobs with
>> more than 2 ranks? Would they be able to please take some time and explain
>> the reasoning? Please note I am not railing against the decision, but rather
>> trying to gather as much information about it as I can so as to be able to
>> better work with my users who are just now starting to ask questions about
>> it. The FAQ pretty much pushes folks to the man pages, and the mpirun man
>> page doesn't go in to the reasoning.
>>
>> Thank you for your time,
>> David
>>
>
> ___
> users mailing list
> users@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users


Re: [OMPI users] what was the rationale behind rank mapping by socket?

2016-09-29 Thread Gilles Gouaillardet

David,


i guess you would have expected the default mapping/binding scheme is 
core instead of sockets


iirc, we decided *not* to bind to cores by default because it is "safer"


if you simply
OMP_NUM_THREADS=8 mpirun -np 2 a.out

then, a default mapping/binding scheme by core means the OpenMP threads 
end up doing time sharing.


this is an honest mistake (8 cores per task were not requested), so 
having a default mapping/binding scheme by socket means


OpenMP threads are spread on the socket and will likely not do time sharing.

/* if you run on a single socket, or if you run 4 tasks on a dual socket 
nodes, then (some) tasks do share the socket,


and depending on how the OpenMP runtime is implemented, two threads of 
two distinct tasks could end up bound/running on the same core */



Cheers,


Gilles


On 9/30/2016 3:04 AM, David Shrader wrote:

Hello All,

Would anyone know why the default mapping scheme is socket for jobs 
with more than 2 ranks? Would they be able to please take some time 
and explain the reasoning? Please note I am not railing against the 
decision, but rather trying to gather as much information about it as 
I can so as to be able to better work with my users who are just now 
starting to ask questions about it. The FAQ pretty much pushes folks 
to the man pages, and the mpirun man page doesn't go in to the reasoning.


Thank you for your time,
David



___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users


[OMPI users] what was the rationale behind rank mapping by socket?

2016-09-29 Thread David Shrader

Hello All,

Would anyone know why the default mapping scheme is socket for jobs with 
more than 2 ranks? Would they be able to please take some time and 
explain the reasoning? Please note I am not railing against the 
decision, but rather trying to gather as much information about it as I 
can so as to be able to better work with my users who are just now 
starting to ask questions about it. The FAQ pretty much pushes folks to 
the man pages, and the mpirun man page doesn't go in to the reasoning.


Thank you for your time,
David

--
David Shrader
HPC-ENV High Performance Computer Systems
Los Alamos National Lab
Email: dshrader  lanl.gov

___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users