[OMPI users] Process mapping and affinity on the devel trunk

Ralph Castain Sun, 11 Dec 2011 00:14:38 -0500

Hello all

If you are using the developer's trunk or nightly tarball, or are interested in 
new mapping and binding options that will be included in the next feature 
series (1.7), then please read on. If not, then please ignore.


People have raised the question of "the trunk isn't binding processes any more" 
a couple of times recently. OMPI's mapping, ranking, and binding options 
underwent a major change on the developer's trunk a few weeks ago. This was 
done to provide a greater range of options for process placement and binding. 
Although this was mentioned on the devel mailing list awhile ago, I thought a 
general message might be in order, especially for those users out there who are 
working with the trunk.

Most importantly, under the new system, opal_paffinity_alone (and its 
pseudonym, mpi_paffinity_alone) was disabled - it no longer does anything. I 
have added a warning so that any setting of that parameter will warn you of 
this situation. This is more than likely the reason why you are not seeing 
processes bound.

That option has been replaced by the --bind-to <foo> option, where <foo> can be 
none, hardware thread (hwthread), core, L1 cache (l1cache), L2 cache (l2cache), 
L3 cache (l3cache), socket, or numa region. This can also be set as an MCA 
parameter "hwloc_base_binding_policy". There are two allowed qualifiers to the 
binding option:

* if-supported - binding will be done if the system supports it. If the system 
does not support it, the application will execute unbound without issuing a 
warning - otherwise, an error message will be emitted and the execution aborted.

* overload-allowed - if the binding results in more processes than cpus being 
bound to a resource (e.g., if 4 processes are bound to a socket that only has 2 
cpus), then execution will be terminated with an error unless this qualifier is 
provided.


Mapping was also expanded to support mapping by all the same locations via the 
--map-by <foo> option, plus two additional locations: slot (default) and node. 
The option is also available as MCA parameter "rmaps_base_mapping_policy". The 
mapping option has three qualifiers:

* span - treat all allocated nodes as if they were a single node - i.e., map 
across all specified resources before looping around and placing the next layer 
of processes on them. The default is to loop across all resources on each node 
until that node is completely filled before moving to the next node, so the 
"span" qualifier acts to balance the load across the allocation.

* oversubscribe - allow more processes than allocated slots to be mapped onto a 
node. This is the default for user-specified allocations (i.e., by hostfile or 
-host).

* nooversubscribe - error out if more processes than allocated slots are mapped 
onto a node. This is the default for resource managed allocations (e.g., 
specified by SLURM or MOAB).

Another mapper was also added to the system. The "ppr" mapper takes a string 
argument detailing the number of processes to be placed on each resource, with 
the supported resources again including all those specified above. For example, 
a string of "4:node,2:socket,1:core" would tell the mapper to place one process 
on every core in the allocation, with a maximum of 2 on each socket and 4 on 
each node.


Assigning process ranks has a corresponding --rank-by <foo> option, with all 
the same values for <foo> as found for mapping (including the use of "slot" as 
the default). This option is available thru the MCA parameter 
"rmaps_base_ranking_policy". The ranking option has two qualifiers:

* span - similar to the mapping qualifier, this causes the ranks to be assigned 
across all specified resources as if they were a single node

* fill - assign ranks sequentially to all processes on the given resource 
before moving to the next one, filling all such resources on each node before 
moving to the next.


Please note that several convenience options were retained for backward 
compatibility:

*  --pernode, --npernode N, --npersocket N: the npersocket option now binds the 
processes to their mapped socket unless another binding option was specified

*  --bind-to-core,  --bind-to-socket

* --bynode, --byslot


All three options (mapping, ranking, binding) can be used in any combination. 
Thus, you can assign a mapping pattern, pick any option for assigning ranks, 
and pick any option for binding. For example, you could map-by socket, rank-by 
core, and bind-to numa. As a result, there are a very large number of ways to 
arrange your application.

I realize all this flexibility can be confusing and a little overwhelming. I am 
working to provide more documentation on the OMPI wiki site, but it isn't done 
yet. I will let people know when it is completed.

HTH
Ralph

[OMPI users] Process mapping and affinity on the devel trunk

Reply via email to