Joseph, Thanks for sharing this !
sysv is imho the worst option because if something goes really wrong, Open MPI might leave some shared memory segments behind when a job crashes. From that perspective, leaving a big file in /tmp can be seen as the lesser evil. That being said, there might be other reasons that drove this design Cheers, Gilles Joseph Schuchart <schuch...@hlrs.de> wrote: >We are currently discussing internally how to proceed with this issue on >our machine. We did a little survey to see the setup of some of the >machines we have access to, which includes an IBM, a Bull machine, and >two Cray XC40 machines. To summarize our findings: > >1) On the Cray systems, both /tmp and /dev/shm are mounted tmpfs and >each limited to half of the main memory size per node. >2) On the IBM system, nodes have 64GB and /tmp is limited to 20 GB and >mounted from a disk partition. /dev/shm, on the other hand, is sized at >63GB. >3) On the above systems, /proc/sys/kernel/shm* is set up to allow the >full memory of the node to be used as System V shared memory. >4) On the Bull machine, /tmp is mounted from a disk and fixed to ~100GB >while /dev/shm is limited to half the node's memory (there are nodes >with 2TB memory, huge page support is available). System V shmem on the >other hand is limited to 4GB. > >Overall, it seems that there is no globally optimal allocation strategy >as the best matching source of shared memory is machine dependent. > >Open MPI treats System V shared memory as the least favorable option, >even giving it a lower priority than POSIX shared memory, where >conflicting names might occur. What's the reason for preferring /tmp and >POSIX shared memory over System V? It seems to me that the latter is a >cleaner and safer way (provided that shared memory is not constrained by >/proc, which could easily be detected) while mmap'ing large files feels >somewhat hacky. Maybe I am missing an important aspect here though. > >The reason I am interested in this issue is that our PGAS library is >build on top of MPI and allocates pretty much all memory exposed to the >user through MPI windows. Thus, any limitation from the underlying MPI >implementation (or system for that matter) limits the amount of usable >memory for our users. > >Given our observations above, I would like to propose a change to the >shared memory allocator: the priorities would be derived from the >percentage of main memory each component can cover, i.e., > >Priority = 99*(min(Memory, SpaceAvail) / Memory) > >At startup, each shm component would determine the available size (by >looking at /tmp, /dev/shm, and /proc/sys/kernel/shm*, respectively) and >set its priority between 0 and 99. A user could force Open MPI to use a >specific component by manually settings its priority to 100 (which of >course has to be documented). The priority could factor in other aspects >as well, such as whether /tmp is actually tmpfs or disk-based if that >makes a difference in performance. > >This proposal of course assumes that shared memory size is the sole >optimization goal. Maybe there are other aspects to consider? I'd be >happy to work on a patch but would like to get some feedback before >getting my hands dirty. IMO, the current situation is less than ideal >and prone to cause pain to the average user. In my recent experience, >debugging this has been tedious and the user in general shouldn't have >to care about how shared memory is allocated (and administrators don't >always seem to care, see above). > >Any feedback is highly appreciated. > >Joseph > > >On 09/04/2017 03:13 PM, Joseph Schuchart wrote: >> Jeff, all, >> >> Unfortunately, I (as a user) have no control over the page size on our >> cluster. My interest in this is more of a general nature because I am >> concerned that our users who use Open MPI underneath our code run into >> this issue on their machine. >> >> I took a look at the code for the various window creation methods and >> now have a better picture of the allocation process in Open MPI. I >> realized that memory in windows allocated through MPI_Win_alloc or >> created through MPI_Win_create is registered with the IB device using >> ibv_reg_mr, which takes significant time for large allocations (I assume >> this is where hugepages would help?). In contrast to this, it seems that >> memory attached through MPI_Win_attach is not registered, which explains >> the lower latency for these allocation I am observing (I seem to >> remember having observed higher communication latencies as well). >> >> Regarding the size limitation of /tmp: I found an opal/mca/shmem/posix >> component that uses shmem_open to create a POSIX shared memory object >> instead of a file on disk, which is then mmap'ed. Unfortunately, if I >> raise the priority of this component above that of the default mmap >> component I end up with a SIGBUS during MPI_Init. No other errors are >> reported by MPI. Should I open a ticket on Github for this? >> >> As an alternative, would it be possible to use anonymous shared memory >> mappings to avoid the backing file for large allocations (maybe above a >> certain threshold) on systems that support MAP_ANONYMOUS and distribute >> the result of the mmap call among the processes on the node? >> >> Thanks, >> Joseph >> >> On 08/29/2017 06:12 PM, Jeff Hammond wrote: >>> I don't know any reason why you shouldn't be able to use IB for >>> intra-node transfers. There are, of course, arguments against doing >>> it in general (e.g. IB/PCI bandwidth less than DDR4 bandwidth), but it >>> likely behaves less synchronously than shared-memory, since I'm not >>> aware of any MPI RMA library that dispatches the intranode RMA >>> operations to an asynchronous agent (e.g. communication helper thread). >>> >>> Regarding 4, faulting 100GB in 24s corresponds to 1us per 4K page, >>> which doesn't sound unreasonable to me. You might investigate if/how >>> you can use 2M or 1G pages instead. It's possible Open-MPI already >>> supports this, if the underlying system does. You may need to twiddle >>> your OS settings to get hugetlbfs working. >>> >>> Jeff >>> >>> On Tue, Aug 29, 2017 at 6:15 AM, Joseph Schuchart <schuch...@hlrs.de >>> <mailto:schuch...@hlrs.de>> wrote: >>> >>> Jeff, all, >>> >>> Thanks for the clarification. My measurements show that global >>> memory allocations do not require the backing file if there is only >>> one process per node, for arbitrary number of processes. So I was >>> wondering if it was possible to use the same allocation process even >>> with multiple processes per node if there is not enough space >>> available in /tmp. However, I am not sure whether the IB devices can >>> be used to perform intra-node RMA. At least that would retain the >>> functionality on this kind of system (which arguably might be a rare >>> case). >>> >>> On a different note, I found during the weekend that Valgrind only >>> supports allocations up to 60GB, so my second point reported below >>> may be invalid. Number 4 seems still seems curious to me, though. >>> >>> Best >>> Joseph >>> >>> On 08/25/2017 09:17 PM, Jeff Hammond wrote: >>> >>> There's no reason to do anything special for shared memory with >>> a single-process job because >>> MPI_Win_allocate_shared(MPI_COMM_SELF) ~= MPI_Alloc_mem(). >>> However, it would help debugging if MPI implementers at least >>> had an option to take the code path that allocates shared memory >>> even when np=1. >>> >>> Jeff >>> >>> On Thu, Aug 24, 2017 at 7:41 AM, Joseph Schuchart >>> <schuch...@hlrs.de <mailto:schuch...@hlrs.de> >>> <mailto:schuch...@hlrs.de <mailto:schuch...@hlrs.de>>> wrote: >>> >>> Gilles, >>> >>> Thanks for your swift response. On this system, /dev/shm >>> only has >>> 256M available so that is no option unfortunately. I tried >>> disabling >>> both vader and sm btl via `--mca btl ^vader,sm` but Open >>> MPI still >>> seems to allocate the shmem backing file under /tmp. From >>> my point >>> of view, missing the performance benefits of file backed >>> shared >>> memory as long as large allocations work but I don't know >>> the >>> implementation details and whether that is possible. It >>> seems that >>> the mmap does not happen if there is only one process per >>> node. >>> >>> Cheers, >>> Joseph >>> >>> >>> On 08/24/2017 03:49 PM, Gilles Gouaillardet wrote: >>> >>> Joseph, >>> >>> the error message suggests that allocating memory with >>> MPI_Win_allocate[_shared] is done by creating a file >>> and then >>> mmap'ing >>> it. >>> how much space do you have in /dev/shm ? (this is a >>> tmpfs e.g. a RAM >>> file system) >>> there is likely quite some space here, so as a >>> workaround, i suggest >>> you use this as the shared-memory backing directory >>> >>> /* i am afk and do not remember the syntax, ompi_info >>> --all | grep >>> backing is likely to help */ >>> >>> Cheers, >>> >>> Gilles >>> >>> On Thu, Aug 24, 2017 at 10:31 PM, Joseph Schuchart >>> <schuch...@hlrs.de <mailto:schuch...@hlrs.de> >>> <mailto:schuch...@hlrs.de <mailto:schuch...@hlrs.de>>> wrote: >>> >>> All, >>> >>> I have been experimenting with large window >>> allocations >>> recently and have >>> made some interesting observations that I would >>> like to share. >>> >>> The system under test: >>> - Linux cluster equipped with IB, >>> - Open MPI 2.1.1, >>> - 128GB main memory per node >>> - 6GB /tmp filesystem per node >>> >>> My observations: >>> 1) Running with 1 process on a single node, I can >>> allocate >>> and write to >>> memory up to ~110 GB through MPI_Allocate, >>> MPI_Win_allocate, and >>> MPI_Win_allocate_shared. >>> >>> 2) If running with 1 process per node on 2 nodes >>> single >>> large allocations >>> succeed but with the repeating allocate/free cycle >>> in the >>> attached code I >>> see the application being reproducibly being killed >>> by the >>> OOM at 25GB >>> allocation with MPI_Win_allocate_shared. When I try >>> to run >>> it under Valgrind >>> I get an error from MPI_Win_allocate at ~50GB that >>> I cannot >>> make sense of: >>> >>> ``` >>> MPI_Alloc_mem: 53687091200 B >>> [n131302:11989] *** An error occurred in >>> MPI_Alloc_mem >>> [n131302:11989] *** reported by process >>> [1567293441,1] >>> [n131302:11989] *** on communicator MPI_COMM_WORLD >>> [n131302:11989] *** MPI_ERR_NO_MEM: out of memory >>> [n131302:11989] *** MPI_ERRORS_ARE_FATAL (processes >>> in this >>> communicator >>> will now abort, >>> [n131302:11989] *** and potentially your MPI job) >>> ``` >>> >>> 3) If running with 2 processes on a node, I get the >>> following error from >>> both MPI_Win_allocate and MPI_Win_allocate_shared: >>> ``` >>> >>> -------------------------------------------------------------------------- >>> >>> It appears as if there is not enough space for >>> >>> /tmp/openmpi-sessions-31390@n131702_0/23041/1/0/shared_window_4.n131702 >>> (the >>> shared-memory backing >>> file). It is likely that your MPI job will now >>> either abort >>> or experience >>> performance degradation. >>> >>> Local host: n131702 >>> Space Requested: 6710890760 B >>> Space Available: 6433673216 B >>> ``` >>> This seems to be related to the size limit of /tmp. >>> MPI_Allocate works as >>> expected, i.e., I can allocate ~50GB per process. I >>> understand that I can >>> set $TMP to a bigger filesystem (such as lustre) >>> but then I >>> am greeted with >>> a warning on each allocation and performance seems >>> to drop. >>> Is there a way >>> to fall back to the allocation strategy used in >>> case 2)? >>> >>> 4) It is also worth noting the time it takes to >>> allocate the >>> memory: while >>> the allocations are in the sub-millisecond range >>> for both >>> MPI_Allocate and >>> MPI_Win_allocate_shared, it takes >24s to allocate >>> 100GB using >>> MPI_Win_allocate and the time increasing linearly >>> with the >>> allocation size. >>> >>> Are these issues known? Maybe there is >>> documentation describing >>> work-arounds? (esp. for 3) and 4)) >>> >>> I am attaching a small benchmark. Please make sure >>> to adjust the >>> MEM_PER_NODE macro to suit your system before you >>> run it :) >>> I'm happy to >>> provide additional details if needed. >>> >>> Best >>> Joseph >>> -- >>> Dipl.-Inf. Joseph Schuchart >>> High Performance Computing Center Stuttgart (HLRS) >>> Nobelstr. 19 >>> D-70569 Stuttgart >>> >>> Tel.: +49(0)711-68565890 >>> Fax: +49(0)711-6856832 >>> E-Mail: schuch...@hlrs.de >>> <mailto:schuch...@hlrs.de> <mailto:schuch...@hlrs.de >>> <mailto:schuch...@hlrs.de>> >>> >>> _______________________________________________ >>> users mailing list >>> users@lists.open-mpi.org <mailto:users@lists.open-mpi.org> >>> <mailto:users@lists.open-mpi.org >>> <mailto:users@lists.open-mpi.org>> >>> https://lists.open-mpi.org/mailman/listinfo/users >>> <https://lists.open-mpi.org/mailman/listinfo/users> >>> <https://lists.open-mpi.org/mailman/listinfo/users >>> <https://lists.open-mpi.org/mailman/listinfo/users>> >>> >>> _______________________________________________ >>> users mailing list >>> users@lists.open-mpi.org <mailto:users@lists.open-mpi.org> >>> <mailto:users@lists.open-mpi.org >>> <mailto:users@lists.open-mpi.org>> >>> https://lists.open-mpi.org/mailman/listinfo/users >>> <https://lists.open-mpi.org/mailman/listinfo/users> >>> <https://lists.open-mpi.org/mailman/listinfo/users >>> <https://lists.open-mpi.org/mailman/listinfo/users>> >>> >>> >>> >>> -- Dipl.-Inf. Joseph Schuchart >>> High Performance Computing Center Stuttgart (HLRS) >>> Nobelstr. 19 >>> D-70569 Stuttgart >>> >>> Tel.: +49(0)711-68565890 >>> Fax: +49(0)711-6856832 >>> E-Mail: schuch...@hlrs.de <mailto:schuch...@hlrs.de> >>> <mailto:schuch...@hlrs.de <mailto:schuch...@hlrs.de>> >>> _______________________________________________ >>> users mailing list >>> users@lists.open-mpi.org <mailto:users@lists.open-mpi.org> >>> <mailto:users@lists.open-mpi.org >>> <mailto:users@lists.open-mpi.org>> >>> https://lists.open-mpi.org/mailman/listinfo/users >>> <https://lists.open-mpi.org/mailman/listinfo/users> >>> <https://lists.open-mpi.org/mailman/listinfo/users >>> <https://lists.open-mpi.org/mailman/listinfo/users>> >>> >>> >>> >>> >>> -- Jeff Hammond >>> jeff.scie...@gmail.com <mailto:jeff.scie...@gmail.com> >>> <mailto:jeff.scie...@gmail.com <mailto:jeff.scie...@gmail.com>> >>> http://jeffhammond.github.io/ >>> >>> >>> _______________________________________________ >>> users mailing list >>> users@lists.open-mpi.org <mailto:users@lists.open-mpi.org> >>> https://lists.open-mpi.org/mailman/listinfo/users >>> <https://lists.open-mpi.org/mailman/listinfo/users> >>> >>> >>> >>> -- Dipl.-Inf. Joseph Schuchart >>> High Performance Computing Center Stuttgart (HLRS) >>> Nobelstr. 19 >>> D-70569 Stuttgart >>> >>> Tel.: +49(0)711-68565890 >>> Fax: +49(0)711-6856832 >>> E-Mail: schuch...@hlrs.de <mailto:schuch...@hlrs.de> >>> _______________________________________________ >>> users mailing list >>> users@lists.open-mpi.org <mailto:users@lists.open-mpi.org> >>> https://lists.open-mpi.org/mailman/listinfo/users >>> <https://lists.open-mpi.org/mailman/listinfo/users> >>> >>> >>> >>> >>> -- >>> Jeff Hammond >>> jeff.scie...@gmail.com <mailto:jeff.scie...@gmail.com> >>> http://jeffhammond.github.io/ >>> >>> >>> _______________________________________________ >>> users mailing list >>> users@lists.open-mpi.org >>> https://lists.open-mpi.org/mailman/listinfo/users >>> >> >> > > >-- >Dipl.-Inf. Joseph Schuchart >High Performance Computing Center Stuttgart (HLRS) >Nobelstr. 19 >D-70569 Stuttgart > >Tel.: +49(0)711-68565890 >Fax: +49(0)711-6856832 >E-Mail: schuch...@hlrs.de >_______________________________________________ >users mailing list >users@lists.open-mpi.org >https://lists.open-mpi.org/mailman/listinfo/users _______________________________________________ users mailing list users@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/users