Re: [hwloc-users] mem bind

2018-12-21 Thread Brice Goglin
Hello

That's not how current operating systems work, hence hwloc cannot do it.
Usually you can bind a process virtual memory to a specific part of the
physical memory (a NUMA node is basically a big static range), but the
reverse isn't allowed by any OS I know.

If you can tweak the hardware, you could try tweaking the ACPI tables so
that a specific range of physical memory moves a new dedicated NUMA node :)

Another crazy idea is to tell the Linux kernel at boot that your ranges
aren't RAM but non-volatile memory. They won't be used by anybody by
default, but you can make them "dax" devices that programs could mmap.

Brice




Le 21/12/2018 à 21:11, Dahai Guo a écrit :
> Hi, 
>
> I was wondering if there is a good way in hwloc to bind a particular
> range of memory to a process? For example, suppose there are totally
> 1000MB on the node, how to bind memory range [50, 100]  to a process,
> and [101,200] to another one?
>
> If hwloc can, an example will be greatly appreciated.
>
> D.
>
> ___
> hwloc-users mailing list
> hwloc-users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/hwloc-users
___
hwloc-users mailing list
hwloc-users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/hwloc-users

[hwloc-users] mem bind

2018-12-21 Thread Dahai Guo
Hi,

I was wondering if there is a good way in hwloc to bind a particular range
of memory to a process? For example, suppose there are totally 1000MB on
the node, how to bind memory range [50, 100]  to a process, and [101,200]
to another one?

If hwloc can, an example will be greatly appreciated.

D.
___
hwloc-users mailing list
hwloc-users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/hwloc-users

Re: [OMPI users] Querying/limiting OpenMPI memory allocations

2018-12-21 Thread Adam Sylvester
I did some additional profiling of my code.  While the application uses 10
ranks, this particular image breaks into two totally independent pieces,
and we split the world communicator, so really this section of code is
using 5 ranks.

There was a ~16 GB allocation buried way inside several layers of classes
that I was not tracking in my total memory calculations that was part of
the issue... obviously nothing to do with OpenMPI.

For the MPI_Allgatherv() stage, there is ~13 GB of data spread roughly
evenly across 5 ranks that we're gathering via MPI_Allgatherv().  During
that function call, I see 6-7 GB extra allocated which must be due to the
underlying buffers used for transfer.  I tried PMPI_Allgatherv() followed
by MPI_Barrier() but I saw the same 6-7 GB spike.  Examining the code more
closely, there is a way I can rearchitect this to send less data across the
ranks (each rank really just needs several rows above and below itself, not
the entire global data).

So, I think I'm set for now - thanks for the help.

On Thu, Dec 20, 2018 at 7:49 PM Gilles Gouaillardet <
gilles.gouaillar...@gmail.com> wrote:

> Adam,
>
> you can rewrite MPI_Allgatherv() in your app. it should simply invoke
> PMPI_Allgatherv() (note the leading 'P') with the same arguments
> followed by MPI_Barrier() in the same communicator (feel free to also
> MPI_Barrier() before PMPI_Allgatherv()).
> That can make your code slower, but it will force the unexpected
> messages related to allgatherv being received.
> If it helps with respect to memory consumption, that means we have a lead
>
> Cheers,
>
> Gilles
>
> On Fri, Dec 21, 2018 at 5:00 AM Jeff Hammond 
> wrote:
> >
> > You might try replacing MPI_Allgatherv with the equivalent Send+Recv
> followed by Broadcast.  I don't think MPI_Allgatherv is particularly
> optimized (since it is hard to do and not a very popular function) and it
> might improve your memory utilization.
> >
> > Jeff
> >
> > On Thu, Dec 20, 2018 at 7:08 AM Adam Sylvester  wrote:
> >>
> >> Gilles,
> >>
> >> It is btl/tcp (we'll be upgrading to newer EC2 types next year to take
> advantage of libfabric).  I need to write a script to log and timestamp the
> memory usage of the process as reported by /proc//stat and sync that
> up with the application's log of what it's doing to say this definitively,
> but based on what I've watched on 'top' so far, I think where these big
> allocations are happening are two areas where I'm doing MPI_Allgatherv() -
> every rank has roughly 1/numRanks of the data (but not divided exactly
> evenly so need to use MPI_Allgatherv)... the ranks are reusing that
> pre-allocated buffer to store their local results and then pass that same
> pre-allocated buffer into MPI_Allgatherv() to bring results in from all
> ranks.  So, there is a lot of communication across all ranks at these
> points.  So, does your comment about using the coll/sync module apply in
> this case?  I'm not familiar with this module - is this something I specify
> at OpenMPI compile time or a runtime option that
>   I enable?
> >>
> >> Thanks for the detailed help.
> >> -Adam
> >>
> >> On Thu, Dec 20, 2018 at 9:41 AM Gilles Gouaillardet <
> gilles.gouaillar...@gmail.com> wrote:
> >>>
> >>> Adam,
> >>>
> >>> Are you using btl/tcp (e.g. plain TCP/IP) for internode communications
> >>> ? Or are you using libfabric on top of the latest EC2 drivers ?
> >>>
> >>> There is no control flow in btl/tcp, which means for example if all
> >>> your nodes send messages to rank 0, that can create a lot of
> >>> unexpected messages on that rank..
> >>> In the case of btl/tcp, this means a lot of malloc() on rank 0, until
> >>> these messages are received by the app.
> >>> If rank 0 is overflowed, then that will likely end up in the node
> >>> swapping to death (or killing your app if you have little or no swap).
> >>>
> >>> If you are using collective operations, make sure the coll/sync module
> >>> is selected.
> >>> This module insert MPI_Barrier() every n collectives on a given
> >>> communicator. This forces your processes to synchronize and can force
> >>> message to be received. (Think of the previous example if you run
> >>> MPI_Scatter(root=0) in a loop)
> >>>
> >>> Cheers,
> >>>
> >>> Gilles
> >>>
> >>> On Thu, Dec 20, 2018 at 11:06 PM Adam Sylvester 
> wrote:
> >>> >
> >>> > This case is actually quite small - 10 physical machines with 18
> physical cores each, 1 rank per machine.  These are AWS R4 instances (Intel
> Xeon E5 Broadwell processors).  OpenMPI version 2.1.0, using TCP (10 Gbps).
> >>> >
> >>> > I calculate the memory needs of my application upfront (in this case
> ~225 GB per machine), allocate one buffer upfront, and reuse this buffer
> for valid and scratch throughout processing.  This is running on RHEL 7 -
> I'm measuring memory usage via top where I see it go up to 248 GB in an
> MPI-intensive portion of processing.
> >>> >
> >>> > I thought I was being quite careful with my memory allocations and
> there weren't 

[MTT users] getting started

2018-12-21 Thread Robert Frank

Hi,

I'm totally new to the testing tools and am looking for some good information 
on how to get started with these tools.
The information I've found so far is far from helpful.

I've checked out https://www.open-mpi.org/projects/mtt/ and the main wiki site 
https://github.com/open-mpi/mtt/wiki
This information is all pretty old and even wrong (such as the reference to the 
svn repository), but I managed to pull a copy from github.

A quick look at the sample configs shows that the documentation available is 
not anywhere close to being useful.

I'd like to set this up such that I can verify the correct working of the many 
MPI installations we have on our cluster.
We use both GCC and the intel compilers, have several mpi incarnations using 
EasyBuild modules and are using SLURM to schedule jobs.

Is there anything out there, which could guide me through the steps to set up 
such a test environment?


Many thanks in advance
Robert

Dep. Mathematik und Informatik  tel   +41 61 207 14 66
Mon , Wed, Thu: whole day, Tue+Fri: afternoon
Universität Basel
Robert Frank
Room 04.009
Spiegelgasse 1 
CH-4051 Basel
Switzerland

___
mtt-users mailing list
mtt-users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/mtt-users