Re: [OMPI users] what was the rationale behind rank mapping by socket?

r...@open-mpi.org Fri, 28 Oct 2016 17:43:22 -0700

Yes, I’ve been hearing a growing number of complaints about cgroups for that 
reason. Our mapping/ranking/binding options will work with the cgroup envelope, 
but it generally winds up with a result that isn’t what the user wanted or 
expected.


We always post the OMPI BoF slides on our web site, and we’ll do the same this 
year. I may try to record webcast on it and post that as well since I know it 
can be confusing given all the flexibility we expose.

In case you haven’t read it yet, here is the relevant section from “man mpirun”:

 Mapping, Ranking, and Binding: Oh My!
       Open MPI employs a three-phase procedure for assigning process locations 
and ranks:

       mapping   Assigns a default location to each process

       ranking   Assigns an MPI_COMM_WORLD rank value to each process

       binding   Constrains each process to run on specific processors

       The mapping step is used to assign a default location to each process 
based on the mapper being employed. Mapping by slot, node,  and  sequentially  
results  in  the
       assignment of the processes to the node level. In contrast, mapping by 
object, allows the mapper to assign the process to an actual object on each 
node.

       Note: the location assigned to the process is independent of where it 
will be bound - the assignment is used solely as input to the binding algorithm.

       The  mapping of process processes to nodes can be defined not just with 
general policies but also, if necessary, using arbitrary mappings that cannot 
be described by
       a simple policy.  One can use the "sequential mapper," which reads the 
hostfile line by line, assigning processes to nodes in whatever order the 
hostfile  specifies.
       Use the -mca rmaps seq option.  For example, using the same hostfile as 
before:

       mpirun -hostfile myhostfile -mca rmaps seq ./a.out

       will  launch three processes, one on each of nodes aa, bb, and cc, 
respectively.  The slot counts don't matter;  one process is launched per line 
on whatever node is
       listed on the line.

       Another way to specify arbitrary mappings is with a rankfile, which 
gives you detailed control over process binding as well.  Rankfiles are 
discussed below.

       The second phase focuses on the ranking of the process within the job's 
MPI_COMM_WORLD.  Open MPI separates this from the mapping procedure to allow 
more flexibility
       in the relative placement of MPI processes. This is best illustrated by 
considering the following two cases where we used the —map-by ppr:2:socket 
option:

                                 node aa       node bb

           rank-by core         0 1 ! 2 3     4 5 ! 6 7

          rank-by socket        0 2 ! 1 3     4 6 ! 5 7

          rank-by socket:span   0 4 ! 1 5     2 6 ! 3 7

       Ranking  by core and by slot provide the identical result - a simple 
progression of MPI_COMM_WORLD ranks across each node. Ranking by socket does a 
round-robin rank‐
       ing within each node until all processes have been assigned an MCW rank, 
and then progresses to the next node. Adding the span  modifier  to  the  
ranking  directive
       causes  the  ranking algorithm to treat the entire allocation as a 
single entity - thus, the MCW ranks are assigned across all sockets before 
circling back around to
       the beginning.

       The binding phase actually binds each process to a given set of 
processors. This can improve performance if the operating system is placing  
processes  suboptimally.
       For  example,  it  might  oversubscribe  some  multi-core processor 
sockets, leaving other sockets idle;  this can lead processes to contend 
unnecessarily for common
       resources.  Or, it might spread processes out too widely;  this can be 
suboptimal if application performance is sensitive to interprocess 
communication costs.  Bind‐
       ing can also keep the operating system from migrating processes 
excessively, regardless of how optimally those processes were placed to begin 
with.

       The  processors  to  be  used  for binding can be identified in terms of 
topological groupings - e.g., binding to an l3cache will bind each process to 
all processors
       within the scope of a single L3 cache within their assigned location. 
Thus, if a process is assigned by the mapper to a  certain  socket,  then  a  
—bind-to  l3cache
       directive will cause the process to be bound to the processors that 
share a single L3 cache within that socket.

       To  help  balance loads, the binding directive uses a round-robin method 
when binding to levels lower than used in the mapper. For example, consider the 
case where a
       job is mapped to the socket level, and then bound to core. Each socket 
will have multiple cores, so if multiple processes are mapped to a given 
socket,  the  binding
       algorithm will assign each process located to a socket to a unique core 
in a round-robin manner.

       Alternatively,  processes mapped by l2cache and then bound to socket 
will simply be bound to all the processors in the socket where they are 
located. In this manner,
       users can exert detailed control over relative MCW rank location and 
binding.

       Finally, --report-bindings can be used to report bindings.

       As an example, consider a node with two processor sockets, each 
comprising four cores.  We run mpirun with -np  4  --report-bindings  and  the  
following  additional
       options:

        % mpirun ... --map-by core --bind-to core
        [...] ... binding child [...,0] to cpus 0001
        [...] ... binding child [...,1] to cpus 0002
        [...] ... binding child [...,2] to cpus 0004
        [...] ... binding child [...,3] to cpus 0008

        % mpirun ... --map-by socket --bind-to socket
        [...] ... binding child [...,0] to socket 0 cpus 000f
        [...] ... binding child [...,1] to socket 1 cpus 00f0
        [...] ... binding child [...,2] to socket 0 cpus 000f
        [...] ... binding child [...,3] to socket 1 cpus 00f0

        % mpirun ... --map-by core:PE=2 --bind-to core
        [...] ... binding child [...,0] to cpus 0003
        [...] ... binding child [...,1] to cpus 000c
        [...] ... binding child [...,2] to cpus 0030
        [...] ... binding child [...,3] to cpus 00c0

        % mpirun ... --bind-to none


      Here, --report-bindings shows the binding of each process as a mask.  In 
the first case, the processes bind to successive cores as indicated by the 
masks 0001, 0002,
       0004, and 0008.  In the second case, processes bind to all cores on 
successive sockets as indicated by the masks 000f and 00f0.  The processes 
cycle through the pro‐
       cessor  sockets  in a round-robin fashion as many times as are needed.  
In the third case, the masks show us that 2 cores have been bound per process.  
In the fourth
       case, binding is turned off and no bindings are reported.

       Open MPI's support for process binding depends on the underlying 
operating system.  Therefore, certain process binding options may not be 
available on every system.

       Process binding can also be set with MCA parameters.  Their usage is 
less convenient than that of mpirun options.  On the other hand, MCA parameters 
can be  set  not
       only on the mpirun command line, but alternatively in a system or user 
mca-params.conf file or as environment variables, as described in the MCA 
section below.  Some
       examples include:

           mpirun option          MCA parameter key         value

         --map-by core          rmaps_base_mapping_policy   core
         --map-by socket        rmaps_base_mapping_policy   socket
         --rank-by core         rmaps_base_ranking_policy   core
         --bind-to core         hwloc_base_binding_policy   core
         --bind-to socket       hwloc_base_binding_policy   socket
         --bind-to none         hwloc_base_binding_policy   none


> On Oct 28, 2016, at 4:50 PM, Bennet Fauber <ben...@umich.edu> wrote:
> 
> Ralph,
> 
> Alas, I will not be at SC16.  I would like to hear and/or see what you
> present, so if it gets made available in alternate format, I'd
> appreciated know where and how to get it.
> 
> I am more and more coming to think that our cluster configuration is
> essentially designed to frustrated MPI developers because we use the
> scheduler to create cgroups (once upon a time, cpusets) for subsets of
> cores on multisocket machines, and I think that invalidates a lot of
> the assumptions that are getting made by people who want to bind to
> particular patters.
> 
> It's our foot, and we have been doing a good job of shooting it.  ;-)
> 
> -- bennet
> 
> 
> 
> 
> On Fri, Oct 28, 2016 at 7:18 PM, r...@open-mpi.org <r...@open-mpi.org> wrote:
>> FWIW: I’ll be presenting “Mapping, Ranking, and Binding - Oh My!” at the
>> OMPI BoF meeting at SC’16, for those who can attend. Will try to explain the
>> rationale as well as the mechanics of the options
>> 
>> On Oct 11, 2016, at 8:09 AM, Dave Love <d.l...@liverpool.ac.uk> wrote:
>> 
>> Gilles Gouaillardet <gil...@rist.or.jp> writes:
>> 
>> Bennet,
>> 
>> 
>> my guess is mapping/binding to sockets was deemed the best compromise
>> from an
>> 
>> "out of the box" performance point of view.
>> 
>> 
>> iirc, we did fix some bugs that occured when running under asymmetric
>> cpusets/cgroups.
>> 
>> if you still have some issues with the latest Open MPI version (2.0.1)
>> and the default policy,
>> 
>> could you please describe them ?
>> 
>> 
>> I also don't understand why binding to sockets is the right thing to do.
>> Binding to cores seems the right default to me, and I set that locally,
>> with instructions about running OpenMP.  (Isn't that what other
>> implementations do, which makes them look better?)
>> 
>> I think at least numa should be used, rather than socket.  Knights
>> Landing, for instance, is single-socket, so no gets no actual binding by
>> default.
>> _______________________________________________
>> users mailing list
>> users@lists.open-mpi.org
>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
>> 
>> 
>> 
>> _______________________________________________
>> users mailing list
>> users@lists.open-mpi.org
>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
> _______________________________________________
> users mailing list
> users@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/users

_______________________________________________
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Re: [OMPI users] what was the rationale behind rank mapping by socket?

Reply via email to