Adam,

you can rewrite MPI_Allgatherv() in your app. it should simply invoke
PMPI_Allgatherv() (note the leading 'P') with the same arguments
followed by MPI_Barrier() in the same communicator (feel free to also
MPI_Barrier() before PMPI_Allgatherv()).
That can make your code slower, but it will force the unexpected
messages related to allgatherv being received.
If it helps with respect to memory consumption, that means we have a lead

Cheers,

Gilles

On Fri, Dec 21, 2018 at 5:00 AM Jeff Hammond <jeff.scie...@gmail.com> wrote:
>
> You might try replacing MPI_Allgatherv with the equivalent Send+Recv followed 
> by Broadcast.  I don't think MPI_Allgatherv is particularly optimized (since 
> it is hard to do and not a very popular function) and it might improve your 
> memory utilization.
>
> Jeff
>
> On Thu, Dec 20, 2018 at 7:08 AM Adam Sylvester <op8...@gmail.com> wrote:
>>
>> Gilles,
>>
>> It is btl/tcp (we'll be upgrading to newer EC2 types next year to take 
>> advantage of libfabric).  I need to write a script to log and timestamp the 
>> memory usage of the process as reported by /proc/<pid>/stat and sync that up 
>> with the application's log of what it's doing to say this definitively, but 
>> based on what I've watched on 'top' so far, I think where these big 
>> allocations are happening are two areas where I'm doing MPI_Allgatherv() - 
>> every rank has roughly 1/numRanks of the data (but not divided exactly 
>> evenly so need to use MPI_Allgatherv)... the ranks are reusing that 
>> pre-allocated buffer to store their local results and then pass that same 
>> pre-allocated buffer into MPI_Allgatherv() to bring results in from all 
>> ranks.  So, there is a lot of communication across all ranks at these 
>> points.  So, does your comment about using the coll/sync module apply in 
>> this case?  I'm not familiar with this module - is this something I specify 
>> at OpenMPI compile time or a runtime option that
  I enable?
>>
>> Thanks for the detailed help.
>> -Adam
>>
>> On Thu, Dec 20, 2018 at 9:41 AM Gilles Gouaillardet 
>> <gilles.gouaillar...@gmail.com> wrote:
>>>
>>> Adam,
>>>
>>> Are you using btl/tcp (e.g. plain TCP/IP) for internode communications
>>> ? Or are you using libfabric on top of the latest EC2 drivers ?
>>>
>>> There is no control flow in btl/tcp, which means for example if all
>>> your nodes send messages to rank 0, that can create a lot of
>>> unexpected messages on that rank..
>>> In the case of btl/tcp, this means a lot of malloc() on rank 0, until
>>> these messages are received by the app.
>>> If rank 0 is overflowed, then that will likely end up in the node
>>> swapping to death (or killing your app if you have little or no swap).
>>>
>>> If you are using collective operations, make sure the coll/sync module
>>> is selected.
>>> This module insert MPI_Barrier() every n collectives on a given
>>> communicator. This forces your processes to synchronize and can force
>>> message to be received. (Think of the previous example if you run
>>> MPI_Scatter(root=0) in a loop)
>>>
>>> Cheers,
>>>
>>> Gilles
>>>
>>> On Thu, Dec 20, 2018 at 11:06 PM Adam Sylvester <op8...@gmail.com> wrote:
>>> >
>>> > This case is actually quite small - 10 physical machines with 18 physical 
>>> > cores each, 1 rank per machine.  These are AWS R4 instances (Intel Xeon 
>>> > E5 Broadwell processors).  OpenMPI version 2.1.0, using TCP (10 Gbps).
>>> >
>>> > I calculate the memory needs of my application upfront (in this case ~225 
>>> > GB per machine), allocate one buffer upfront, and reuse this buffer for 
>>> > valid and scratch throughout processing.  This is running on RHEL 7 - I'm 
>>> > measuring memory usage via top where I see it go up to 248 GB in an 
>>> > MPI-intensive portion of processing.
>>> >
>>> > I thought I was being quite careful with my memory allocations and there 
>>> > weren't any other stray allocations going on, but of course it's possible 
>>> > there's a large temp buffer somewhere that I've missed... based on what 
>>> > you're saying, this is way more memory than should be attributed to 
>>> > OpenMPI - is there a way I can query OpenMPI to confirm that?  If the OS 
>>> > is unable to keep up with the network traffic, is it possible there's 
>>> > some low-level system buffer that gets allocated to gradually work off 
>>> > the TCP traffic?
>>> >
>>> > Thanks.
>>> >
>>> > On Thu, Dec 20, 2018 at 8:32 AM Nathan Hjelm via users 
>>> > <users@lists.open-mpi.org> wrote:
>>> >>
>>> >> How many nodes are you using? How many processes per node? What kind of 
>>> >> processor? Open MPI version? 25 GB is several orders of magnitude more 
>>> >> memory than should be used except at extreme scale (1M+ processes). 
>>> >> Also, how are you calculating memory usage?
>>> >>
>>> >> -Nathan
>>> >>
>>> >> > On Dec 20, 2018, at 4:49 AM, Adam Sylvester <op8...@gmail.com> wrote:
>>> >> >
>>> >> > Is there a way at runtime to query OpenMPI to ask it how much memory 
>>> >> > it's using for internal buffers?  Is there a way at runtime to set a 
>>> >> > max amount of memory OpenMPI will use for these buffers?  I have an 
>>> >> > application where for certain inputs OpenMPI appears to be allocating 
>>> >> > ~25 GB and I'm not accounting for this in my memory calculations (and 
>>> >> > thus bricking the machine).
>>> >> >
>>> >> > Thanks.
>>> >> > -Adam
>>> >> > _______________________________________________
>>> >> > users mailing list
>>> >> > users@lists.open-mpi.org
>>> >> > https://lists.open-mpi.org/mailman/listinfo/users
>>> >>
>>> >> _______________________________________________
>>> >> users mailing list
>>> >> users@lists.open-mpi.org
>>> >> https://lists.open-mpi.org/mailman/listinfo/users
>>> >
>>> > _______________________________________________
>>> > users mailing list
>>> > users@lists.open-mpi.org
>>> > https://lists.open-mpi.org/mailman/listinfo/users
>>> _______________________________________________
>>> users mailing list
>>> users@lists.open-mpi.org
>>> https://lists.open-mpi.org/mailman/listinfo/users
>>
>> _______________________________________________
>> users mailing list
>> users@lists.open-mpi.org
>> https://lists.open-mpi.org/mailman/listinfo/users
>
>
>
> --
> Jeff Hammond
> jeff.scie...@gmail.com
> http://jeffhammond.github.io/
> _______________________________________________
> users mailing list
> users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/users
_______________________________________________
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Reply via email to