You might try replacing MPI_Allgatherv with the equivalent Send+Recv followed by Broadcast. I don't think MPI_Allgatherv is particularly optimized (since it is hard to do and not a very popular function) and it might improve your memory utilization.
Jeff On Thu, Dec 20, 2018 at 7:08 AM Adam Sylvester <op8...@gmail.com> wrote: > Gilles, > > It is btl/tcp (we'll be upgrading to newer EC2 types next year to take > advantage of libfabric). I need to write a script to log and timestamp the > memory usage of the process as reported by /proc/<pid>/stat and sync that > up with the application's log of what it's doing to say this definitively, > but based on what I've watched on 'top' so far, I think where these big > allocations are happening are two areas where I'm doing MPI_Allgatherv() - > every rank has roughly 1/numRanks of the data (but not divided exactly > evenly so need to use MPI_Allgatherv)... the ranks are reusing that > pre-allocated buffer to store their local results and then pass that same > pre-allocated buffer into MPI_Allgatherv() to bring results in from all > ranks. So, there is a lot of communication across all ranks at these > points. So, does your comment about using the coll/sync module apply in > this case? I'm not familiar with this module - is this something I specify > at OpenMPI compile time or a runtime option that I enable? > > Thanks for the detailed help. > -Adam > > On Thu, Dec 20, 2018 at 9:41 AM Gilles Gouaillardet < > gilles.gouaillar...@gmail.com> wrote: > >> Adam, >> >> Are you using btl/tcp (e.g. plain TCP/IP) for internode communications >> ? Or are you using libfabric on top of the latest EC2 drivers ? >> >> There is no control flow in btl/tcp, which means for example if all >> your nodes send messages to rank 0, that can create a lot of >> unexpected messages on that rank.. >> In the case of btl/tcp, this means a lot of malloc() on rank 0, until >> these messages are received by the app. >> If rank 0 is overflowed, then that will likely end up in the node >> swapping to death (or killing your app if you have little or no swap). >> >> If you are using collective operations, make sure the coll/sync module >> is selected. >> This module insert MPI_Barrier() every n collectives on a given >> communicator. This forces your processes to synchronize and can force >> message to be received. (Think of the previous example if you run >> MPI_Scatter(root=0) in a loop) >> >> Cheers, >> >> Gilles >> >> On Thu, Dec 20, 2018 at 11:06 PM Adam Sylvester <op8...@gmail.com> wrote: >> > >> > This case is actually quite small - 10 physical machines with 18 >> physical cores each, 1 rank per machine. These are AWS R4 instances (Intel >> Xeon E5 Broadwell processors). OpenMPI version 2.1.0, using TCP (10 Gbps). >> > >> > I calculate the memory needs of my application upfront (in this case >> ~225 GB per machine), allocate one buffer upfront, and reuse this buffer >> for valid and scratch throughout processing. This is running on RHEL 7 - >> I'm measuring memory usage via top where I see it go up to 248 GB in an >> MPI-intensive portion of processing. >> > >> > I thought I was being quite careful with my memory allocations and >> there weren't any other stray allocations going on, but of course it's >> possible there's a large temp buffer somewhere that I've missed... based on >> what you're saying, this is way more memory than should be attributed to >> OpenMPI - is there a way I can query OpenMPI to confirm that? If the OS is >> unable to keep up with the network traffic, is it possible there's some >> low-level system buffer that gets allocated to gradually work off the TCP >> traffic? >> > >> > Thanks. >> > >> > On Thu, Dec 20, 2018 at 8:32 AM Nathan Hjelm via users < >> firstname.lastname@example.org> wrote: >> >> >> >> How many nodes are you using? How many processes per node? What kind >> of processor? Open MPI version? 25 GB is several orders of magnitude more >> memory than should be used except at extreme scale (1M+ processes). Also, >> how are you calculating memory usage? >> >> >> >> -Nathan >> >> >> >> > On Dec 20, 2018, at 4:49 AM, Adam Sylvester <op8...@gmail.com> >> wrote: >> >> > >> >> > Is there a way at runtime to query OpenMPI to ask it how much memory >> it's using for internal buffers? Is there a way at runtime to set a max >> amount of memory OpenMPI will use for these buffers? I have an application >> where for certain inputs OpenMPI appears to be allocating ~25 GB and I'm >> not accounting for this in my memory calculations (and thus bricking the >> machine). >> >> > >> >> > Thanks. >> >> > -Adam >> >> > _______________________________________________ >> >> > users mailing list >> >> > email@example.com >> >> > https://lists.open-mpi.org/mailman/listinfo/users >> >> >> >> _______________________________________________ >> >> users mailing list >> >> firstname.lastname@example.org >> >> https://lists.open-mpi.org/mailman/listinfo/users >> > >> > _______________________________________________ >> > users mailing list >> > email@example.com >> > https://lists.open-mpi.org/mailman/listinfo/users >> _______________________________________________ >> users mailing list >> firstname.lastname@example.org >> https://lists.open-mpi.org/mailman/listinfo/users >> > _______________________________________________ > users mailing list > email@example.com > https://lists.open-mpi.org/mailman/listinfo/users -- Jeff Hammond jeff.scie...@gmail.com http://jeffhammond.github.io/
_______________________________________________ users mailing list firstname.lastname@example.org https://lists.open-mpi.org/mailman/listinfo/users