You might try replacing MPI_Allgatherv with the equivalent Send+Recv
followed by Broadcast.  I don't think MPI_Allgatherv is particularly
optimized (since it is hard to do and not a very popular function) and it
might improve your memory utilization.

Jeff

On Thu, Dec 20, 2018 at 7:08 AM Adam Sylvester <op8...@gmail.com> wrote:

> Gilles,
>
> It is btl/tcp (we'll be upgrading to newer EC2 types next year to take
> advantage of libfabric).  I need to write a script to log and timestamp the
> memory usage of the process as reported by /proc/<pid>/stat and sync that
> up with the application's log of what it's doing to say this definitively,
> but based on what I've watched on 'top' so far, I think where these big
> allocations are happening are two areas where I'm doing MPI_Allgatherv() -
> every rank has roughly 1/numRanks of the data (but not divided exactly
> evenly so need to use MPI_Allgatherv)... the ranks are reusing that
> pre-allocated buffer to store their local results and then pass that same
> pre-allocated buffer into MPI_Allgatherv() to bring results in from all
> ranks.  So, there is a lot of communication across all ranks at these
> points.  So, does your comment about using the coll/sync module apply in
> this case?  I'm not familiar with this module - is this something I specify
> at OpenMPI compile time or a runtime option that I enable?
>
> Thanks for the detailed help.
> -Adam
>
> On Thu, Dec 20, 2018 at 9:41 AM Gilles Gouaillardet <
> gilles.gouaillar...@gmail.com> wrote:
>
>> Adam,
>>
>> Are you using btl/tcp (e.g. plain TCP/IP) for internode communications
>> ? Or are you using libfabric on top of the latest EC2 drivers ?
>>
>> There is no control flow in btl/tcp, which means for example if all
>> your nodes send messages to rank 0, that can create a lot of
>> unexpected messages on that rank..
>> In the case of btl/tcp, this means a lot of malloc() on rank 0, until
>> these messages are received by the app.
>> If rank 0 is overflowed, then that will likely end up in the node
>> swapping to death (or killing your app if you have little or no swap).
>>
>> If you are using collective operations, make sure the coll/sync module
>> is selected.
>> This module insert MPI_Barrier() every n collectives on a given
>> communicator. This forces your processes to synchronize and can force
>> message to be received. (Think of the previous example if you run
>> MPI_Scatter(root=0) in a loop)
>>
>> Cheers,
>>
>> Gilles
>>
>> On Thu, Dec 20, 2018 at 11:06 PM Adam Sylvester <op8...@gmail.com> wrote:
>> >
>> > This case is actually quite small - 10 physical machines with 18
>> physical cores each, 1 rank per machine.  These are AWS R4 instances (Intel
>> Xeon E5 Broadwell processors).  OpenMPI version 2.1.0, using TCP (10 Gbps).
>> >
>> > I calculate the memory needs of my application upfront (in this case
>> ~225 GB per machine), allocate one buffer upfront, and reuse this buffer
>> for valid and scratch throughout processing.  This is running on RHEL 7 -
>> I'm measuring memory usage via top where I see it go up to 248 GB in an
>> MPI-intensive portion of processing.
>> >
>> > I thought I was being quite careful with my memory allocations and
>> there weren't any other stray allocations going on, but of course it's
>> possible there's a large temp buffer somewhere that I've missed... based on
>> what you're saying, this is way more memory than should be attributed to
>> OpenMPI - is there a way I can query OpenMPI to confirm that?  If the OS is
>> unable to keep up with the network traffic, is it possible there's some
>> low-level system buffer that gets allocated to gradually work off the TCP
>> traffic?
>> >
>> > Thanks.
>> >
>> > On Thu, Dec 20, 2018 at 8:32 AM Nathan Hjelm via users <
>> users@lists.open-mpi.org> wrote:
>> >>
>> >> How many nodes are you using? How many processes per node? What kind
>> of processor? Open MPI version? 25 GB is several orders of magnitude more
>> memory than should be used except at extreme scale (1M+ processes). Also,
>> how are you calculating memory usage?
>> >>
>> >> -Nathan
>> >>
>> >> > On Dec 20, 2018, at 4:49 AM, Adam Sylvester <op8...@gmail.com>
>> wrote:
>> >> >
>> >> > Is there a way at runtime to query OpenMPI to ask it how much memory
>> it's using for internal buffers?  Is there a way at runtime to set a max
>> amount of memory OpenMPI will use for these buffers?  I have an application
>> where for certain inputs OpenMPI appears to be allocating ~25 GB and I'm
>> not accounting for this in my memory calculations (and thus bricking the
>> machine).
>> >> >
>> >> > Thanks.
>> >> > -Adam
>> >> > _______________________________________________
>> >> > users mailing list
>> >> > users@lists.open-mpi.org
>> >> > https://lists.open-mpi.org/mailman/listinfo/users
>> >>
>> >> _______________________________________________
>> >> users mailing list
>> >> users@lists.open-mpi.org
>> >> https://lists.open-mpi.org/mailman/listinfo/users
>> >
>> > _______________________________________________
>> > users mailing list
>> > users@lists.open-mpi.org
>> > https://lists.open-mpi.org/mailman/listinfo/users
>> _______________________________________________
>> users mailing list
>> users@lists.open-mpi.org
>> https://lists.open-mpi.org/mailman/listinfo/users
>>
> _______________________________________________
> users mailing list
> users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/users



-- 
Jeff Hammond
jeff.scie...@gmail.com
http://jeffhammond.github.io/
_______________________________________________
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Reply via email to