Re: [OMPI users] NUMA interaction with Open MPI

Brice Goglin Thu, 20 Jul 2017 00:54:08 -0700

Hello

Mems_allowed_list is what your current cgroup/cpuset allows. It is
different from what mbind/numactl/hwloc/... change.
The former is a root-only restriction that cannot be ignored by
processes placed in that cgroup.
The latter is a user-changeable binding that must be inside the former.


Brice




Le 19/07/2017 17:29, Iliev, Hristo a écrit :
> Giles,
>
> Mems_allowed_list has never worked for me:
>
> $ uname -r
> 3.10.0-514.26.1.e17.x86_64
>
> $ numactl -H | grep available
> available: 2 nodes (0-1)
>
> $ grep Mems_allowed_list /proc/self/status
> Mems_allowed_list:      0-1
>
> $ numactl -m 0 grep Mems_allowed_list /proc/self/status
> Mems_allowed_list:      0-1
>
> It seems that whatever structure Mems_allowed_list exposes is outdated. One 
> should use "numactl -s" instead:
>
> $ numactl -s
> policy: default
> preferred node: current
> physcpubind: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 
> 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47
> cpubind: 0 1
> nodebind: 0 1
> membind: 0 1
>
> $ numactl -m 0 numactl -s
> policy: bind
> preferred node: 0
> physcpubind: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 
> 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47
> cpubind: 0 1
> nodebind: 0 1
> membind: 0
>
> $ numactl -i all numactl -s
> policy: interleave
> preferred node: 0 (interleave next)
> interleavemask: 0 1
> interleavenode: 0
> physcpubind: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 
> 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47
> cpubind: 0 1
> nodebind: 0 1
> membind: 0 1
>
> I wouldn't ask Open MPI not to bind the processes as the policy set by 
> numactl is of higher precedence compared to what orterun/shepherd sets, at 
> least with non-MPI programs:
>
> $ orterun -n 2 --bind-to core --map-by socket numactl -i all numactl -s
> policy: interleave
> preferred node: 1 (interleave next)
> interleavemask: 0 1
> interleavenode: 1
> physcpubind: 0 24
> cpubind: 0
> nodebind: 0
> membind: 0 1
> policy: interleave
> preferred node: 1 (interleave next)
> interleavemask: 0 1
> interleavenode: 1
> physcpubind: 12 36
> cpubind: 1
> nodebind: 1
> membind: 0 1
>
> Cheers,
> Hristo
>
> -----Original Message-----
> From: users [mailto:users-boun...@lists.open-mpi.org] On Behalf Of Gilles 
> Gouaillardet
> Sent: Monday, July 17, 2017 5:43 AM
> To: Open MPI Users <users@lists.open-mpi.org>
> Subject: Re: [OMPI users] NUMA interaction with Open MPI
>
> Adam,
>
> keep in mind that by default, recent Open MPI bind MPI tasks
> - to cores if -np 2
> - to NUMA domain otherwise (which is a socket in most cases, unless
> you are running on a Xeon Phi)
>
> so unless you specifically asked mpirun to do a binding consistent
> with your needs, you might simply try to ask no binding at all
> mpirun --bind-to none ...
>
> i am not sure whether you can direclty ask Open MPI to do the memory
> binding you expect from the command line.
> anyway, as far as i am concerned,
> mpirun --bind-to none numactl --interleave=all ...
> should do what you expect
>
> if you want to be sure, you can simply
> mpirun --bind-to none numactl --interleave=all grep Mems_allowed_list
> /proc/self/status
> and that should give you an hint
>
> Cheers,
>
> Gilles
>
>
> On Mon, Jul 17, 2017 at 4:19 AM, Adam Sylvester <op8...@gmail.com> wrote:
>> I'll start with my question upfront: Is there a way to do the equivalent of
>> telling mpirun to do 'numactl --interleave=all' on the processes that it
>> runs?  Or if I want to control the memory placement of my applications run
>> through MPI will I need to use libnuma for this?  I tried doing "mpirun
>> <Open MPI options> numactl --interleave=all <app name and options>".  I
>> don't know how to explicitly verify if this ran the numactl command on each
>> host or not but based on the performance I'm seeing, it doesn't seem like it
>> did (or something else is causing my poor performance).
>>
>> More details: For the particular image I'm benchmarking with, I have a
>> multi-threaded application which requires 60 GB of RAM to run if it's run on
>> one machine.  It allocates one large ping/pong buffer upfront and uses this
>> to avoid copies when updating the image at each step.  I'm running in AWS
>> and comparing performance on an r3.8xlarge (16 CPUs, 244 GB RAM, 10 Gbps)
>> vs. an x1.32xlarge (64 CPUs, 2 TB RAM, 20 Gbps).  Running on a single X1, my
>> application runs ~3x faster than the R3; using numactl --interleave=all has
>> a significant positive effect on its performance,  I assume because the
>> various threads that are running are accessing memory spread out across the
>> nodes rather than most of them having slow access to it.  So far so good.
>>
>> My application also supports distributing across machines via MPI.  When
>> doing this, the memory requirement scales linearly with the number of
>> machines; there are three pinch points that involve large (GBs of data)
>> all-to-all communication.  For the slowest of these three, I've pipelined
>> this step and use MPI_Ialltoallv() to hide as much of the latency as I can.
>> When run on R3 instances, overall runtime scales very well as machines are
>> added.  Still so far so good.
>>
>> My problems start with the X1 instances.  I do get scaling as I add more
>> machines, but it is significantly worse than with the R3s.  This isn't just
>> a matter of there being more CPUs and the MPI communication time dominating.
>> The actual time spent in the MPI all-to-all communication is significantly
>> longer than on the R3s for the same number of machines, despite the network
>> bandwidth being twice as high (in a post from a few days ago some folks
>> helped me with MPI settings to improve the network communication speed -
>> from toy benchmark MPI tests I know I'm getting faster communication on the
>> X1s than on the R3s, so this feels likely to be an issue with NUMA, though
>> I'd be interested in any other thoughts.
>>
>> I looked at https://www.open-mpi.org/doc/current/man1/mpirun.1.php but this
>> didn't seem to have what I was looking for.  I want MPI to let my
>> application use all CPUs on the system (I'm the only one running on it)... I
>> just want to control the memory placement.
>>
>> Thanks for the help.
>> -Adam
>>
>>
>>
>> _______________________________________________
>> users mailing list
>> users@lists.open-mpi.org
>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
> _______________________________________________
> users mailing list
> users@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
> _______________________________________________
> users mailing list
> users@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/users

_______________________________________________
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Re: [OMPI users] NUMA interaction with Open MPI

Reply via email to