Adam, you can rewrite MPI_Allgatherv() in your app. it should simply invoke PMPI_Allgatherv() (note the leading 'P') with the same arguments followed by MPI_Barrier() in the same communicator (feel free to also MPI_Barrier() before PMPI_Allgatherv()). That can make your code slower, but it will force the unexpected messages related to allgatherv being received. If it helps with respect to memory consumption, that means we have a lead
Cheers, Gilles On Fri, Dec 21, 2018 at 5:00 AM Jeff Hammond <jeff.scie...@gmail.com> wrote: > > You might try replacing MPI_Allgatherv with the equivalent Send+Recv followed > by Broadcast. I don't think MPI_Allgatherv is particularly optimized (since > it is hard to do and not a very popular function) and it might improve your > memory utilization. > > Jeff > > On Thu, Dec 20, 2018 at 7:08 AM Adam Sylvester <op8...@gmail.com> wrote: >> >> Gilles, >> >> It is btl/tcp (we'll be upgrading to newer EC2 types next year to take >> advantage of libfabric). I need to write a script to log and timestamp the >> memory usage of the process as reported by /proc/<pid>/stat and sync that up >> with the application's log of what it's doing to say this definitively, but >> based on what I've watched on 'top' so far, I think where these big >> allocations are happening are two areas where I'm doing MPI_Allgatherv() - >> every rank has roughly 1/numRanks of the data (but not divided exactly >> evenly so need to use MPI_Allgatherv)... the ranks are reusing that >> pre-allocated buffer to store their local results and then pass that same >> pre-allocated buffer into MPI_Allgatherv() to bring results in from all >> ranks. So, there is a lot of communication across all ranks at these >> points. So, does your comment about using the coll/sync module apply in >> this case? I'm not familiar with this module - is this something I specify >> at OpenMPI compile time or a runtime option that I enable? >> >> Thanks for the detailed help. >> -Adam >> >> On Thu, Dec 20, 2018 at 9:41 AM Gilles Gouaillardet >> <gilles.gouaillar...@gmail.com> wrote: >>> >>> Adam, >>> >>> Are you using btl/tcp (e.g. plain TCP/IP) for internode communications >>> ? Or are you using libfabric on top of the latest EC2 drivers ? >>> >>> There is no control flow in btl/tcp, which means for example if all >>> your nodes send messages to rank 0, that can create a lot of >>> unexpected messages on that rank.. >>> In the case of btl/tcp, this means a lot of malloc() on rank 0, until >>> these messages are received by the app. >>> If rank 0 is overflowed, then that will likely end up in the node >>> swapping to death (or killing your app if you have little or no swap). >>> >>> If you are using collective operations, make sure the coll/sync module >>> is selected. >>> This module insert MPI_Barrier() every n collectives on a given >>> communicator. This forces your processes to synchronize and can force >>> message to be received. (Think of the previous example if you run >>> MPI_Scatter(root=0) in a loop) >>> >>> Cheers, >>> >>> Gilles >>> >>> On Thu, Dec 20, 2018 at 11:06 PM Adam Sylvester <op8...@gmail.com> wrote: >>> > >>> > This case is actually quite small - 10 physical machines with 18 physical >>> > cores each, 1 rank per machine. These are AWS R4 instances (Intel Xeon >>> > E5 Broadwell processors). OpenMPI version 2.1.0, using TCP (10 Gbps). >>> > >>> > I calculate the memory needs of my application upfront (in this case ~225 >>> > GB per machine), allocate one buffer upfront, and reuse this buffer for >>> > valid and scratch throughout processing. This is running on RHEL 7 - I'm >>> > measuring memory usage via top where I see it go up to 248 GB in an >>> > MPI-intensive portion of processing. >>> > >>> > I thought I was being quite careful with my memory allocations and there >>> > weren't any other stray allocations going on, but of course it's possible >>> > there's a large temp buffer somewhere that I've missed... based on what >>> > you're saying, this is way more memory than should be attributed to >>> > OpenMPI - is there a way I can query OpenMPI to confirm that? If the OS >>> > is unable to keep up with the network traffic, is it possible there's >>> > some low-level system buffer that gets allocated to gradually work off >>> > the TCP traffic? >>> > >>> > Thanks. >>> > >>> > On Thu, Dec 20, 2018 at 8:32 AM Nathan Hjelm via users >>> > <users@lists.open-mpi.org> wrote: >>> >> >>> >> How many nodes are you using? How many processes per node? What kind of >>> >> processor? Open MPI version? 25 GB is several orders of magnitude more >>> >> memory than should be used except at extreme scale (1M+ processes). >>> >> Also, how are you calculating memory usage? >>> >> >>> >> -Nathan >>> >> >>> >> > On Dec 20, 2018, at 4:49 AM, Adam Sylvester <op8...@gmail.com> wrote: >>> >> > >>> >> > Is there a way at runtime to query OpenMPI to ask it how much memory >>> >> > it's using for internal buffers? Is there a way at runtime to set a >>> >> > max amount of memory OpenMPI will use for these buffers? I have an >>> >> > application where for certain inputs OpenMPI appears to be allocating >>> >> > ~25 GB and I'm not accounting for this in my memory calculations (and >>> >> > thus bricking the machine). >>> >> > >>> >> > Thanks. >>> >> > -Adam >>> >> > _______________________________________________ >>> >> > users mailing list >>> >> > users@lists.open-mpi.org >>> >> > https://lists.open-mpi.org/mailman/listinfo/users >>> >> >>> >> _______________________________________________ >>> >> users mailing list >>> >> users@lists.open-mpi.org >>> >> https://lists.open-mpi.org/mailman/listinfo/users >>> > >>> > _______________________________________________ >>> > users mailing list >>> > users@lists.open-mpi.org >>> > https://lists.open-mpi.org/mailman/listinfo/users >>> _______________________________________________ >>> users mailing list >>> users@lists.open-mpi.org >>> https://lists.open-mpi.org/mailman/listinfo/users >> >> _______________________________________________ >> users mailing list >> users@lists.open-mpi.org >> https://lists.open-mpi.org/mailman/listinfo/users > > > > -- > Jeff Hammond > jeff.scie...@gmail.com > http://jeffhammond.github.io/ > _______________________________________________ > users mailing list > users@lists.open-mpi.org > https://lists.open-mpi.org/mailman/listinfo/users _______________________________________________ users mailing list users@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/users