Joseph,

Thanks for sharing this !

sysv is imho the worst option because if something goes really wrong, Open MPI 
might leave some shared memory segments behind when a job crashes. From that 
perspective, leaving a big file in /tmp can be seen as the lesser evil.
That being said, there might be other reasons that drove this design

Cheers,

Gilles

Joseph Schuchart <schuch...@hlrs.de> wrote:
>We are currently discussing internally how to proceed with this issue on 
>our machine. We did a little survey to see the setup of some of the 
>machines we have access to, which includes an IBM, a Bull machine, and 
>two Cray XC40 machines. To summarize our findings:
>
>1) On the Cray systems, both /tmp and /dev/shm are mounted tmpfs and 
>each limited to half of the main memory size per node.
>2) On the IBM system, nodes have 64GB and /tmp is limited to 20 GB and 
>mounted from a disk partition. /dev/shm, on the other hand, is sized at 
>63GB.
>3) On the above systems, /proc/sys/kernel/shm* is set up to allow the 
>full memory of the node to be used as System V shared memory.
>4) On the Bull machine, /tmp is mounted from a disk and fixed to ~100GB 
>while /dev/shm is limited to half the node's memory (there are nodes 
>with 2TB memory, huge page support is available). System V shmem on the 
>other hand is limited to 4GB.
>
>Overall, it seems that there is no globally optimal allocation strategy 
>as the best matching source of shared memory is machine dependent.
>
>Open MPI treats System V shared memory as the least favorable option, 
>even giving it a lower priority than POSIX shared memory, where 
>conflicting names might occur. What's the reason for preferring /tmp and 
>POSIX shared memory over System V? It seems to me that the latter is a 
>cleaner and safer way (provided that shared memory is not constrained by 
>/proc, which could easily be detected) while mmap'ing large files feels 
>somewhat hacky. Maybe I am missing an important aspect here though.
>
>The reason I am interested in this issue is that our PGAS library is 
>build on top of MPI and allocates pretty much all memory exposed to the 
>user through MPI windows. Thus, any limitation from the underlying MPI 
>implementation (or system for that matter) limits the amount of usable 
>memory for our users.
>
>Given our observations above, I would like to propose a change to the 
>shared memory allocator: the priorities would be derived from the 
>percentage of main memory each component can cover, i.e.,
>
>Priority = 99*(min(Memory, SpaceAvail) / Memory)
>
>At startup, each shm component would determine the available size (by 
>looking at /tmp, /dev/shm, and /proc/sys/kernel/shm*, respectively) and 
>set its priority between 0 and 99. A user could force Open MPI to use a 
>specific component by manually settings its priority to 100 (which of 
>course has to be documented). The priority could factor in other aspects 
>as well, such as whether /tmp is actually tmpfs or disk-based if that 
>makes a difference in performance.
>
>This proposal of course assumes that shared memory size is the sole 
>optimization goal. Maybe there are other aspects to consider? I'd be 
>happy to work on a patch but would like to get some feedback before 
>getting my hands dirty. IMO, the current situation is less than ideal 
>and prone to cause pain to the average user. In my recent experience, 
>debugging this has been tedious and the user in general shouldn't have 
>to care about how shared memory is allocated (and administrators don't 
>always seem to care, see above).
>
>Any feedback is highly appreciated.
>
>Joseph
>
>
>On 09/04/2017 03:13 PM, Joseph Schuchart wrote:
>> Jeff, all,
>> 
>> Unfortunately, I (as a user) have no control over the page size on our 
>> cluster. My interest in this is more of a general nature because I am 
>> concerned that our users who use Open MPI underneath our code run into 
>> this issue on their machine.
>> 
>> I took a look at the code for the various window creation methods and 
>> now have a better picture of the allocation process in Open MPI. I 
>> realized that memory in windows allocated through MPI_Win_alloc or 
>> created through MPI_Win_create is registered with the IB device using 
>> ibv_reg_mr, which takes significant time for large allocations (I assume 
>> this is where hugepages would help?). In contrast to this, it seems that 
>> memory attached through MPI_Win_attach is not registered, which explains 
>> the lower latency for these allocation I am observing (I seem to 
>> remember having observed higher communication latencies as well).
>> 
>> Regarding the size limitation of /tmp: I found an opal/mca/shmem/posix 
>> component that uses shmem_open to create a POSIX shared memory object 
>> instead of a file on disk, which is then mmap'ed. Unfortunately, if I 
>> raise the priority of this component above that of the default mmap 
>> component I end up with a SIGBUS during MPI_Init. No other errors are 
>> reported by MPI. Should I open a ticket on Github for this?
>> 
>> As an alternative, would it be possible to use anonymous shared memory 
>> mappings to avoid the backing file for large allocations (maybe above a 
>> certain threshold) on systems that support MAP_ANONYMOUS and distribute 
>> the result of the mmap call among the processes on the node?
>> 
>> Thanks,
>> Joseph
>> 
>> On 08/29/2017 06:12 PM, Jeff Hammond wrote:
>>> I don't know any reason why you shouldn't be able to use IB for 
>>> intra-node transfers.  There are, of course, arguments against doing 
>>> it in general (e.g. IB/PCI bandwidth less than DDR4 bandwidth), but it 
>>> likely behaves less synchronously than shared-memory, since I'm not 
>>> aware of any MPI RMA library that dispatches the intranode RMA 
>>> operations to an asynchronous agent (e.g. communication helper thread).
>>>
>>> Regarding 4, faulting 100GB in 24s corresponds to 1us per 4K page, 
>>> which doesn't sound unreasonable to me.  You might investigate if/how 
>>> you can use 2M or 1G pages instead.  It's possible Open-MPI already 
>>> supports this, if the underlying system does.  You may need to twiddle 
>>> your OS settings to get hugetlbfs working.
>>>
>>> Jeff
>>>
>>> On Tue, Aug 29, 2017 at 6:15 AM, Joseph Schuchart <schuch...@hlrs.de 
>>> <mailto:schuch...@hlrs.de>> wrote:
>>>
>>>     Jeff, all,
>>>
>>>     Thanks for the clarification. My measurements show that global
>>>     memory allocations do not require the backing file if there is only
>>>     one process per node, for arbitrary number of processes. So I was
>>>     wondering if it was possible to use the same allocation process even
>>>     with multiple processes per node if there is not enough space
>>>     available in /tmp. However, I am not sure whether the IB devices can
>>>     be used to perform intra-node RMA. At least that would retain the
>>>     functionality on this kind of system (which arguably might be a rare
>>>     case).
>>>
>>>     On a different note, I found during the weekend that Valgrind only
>>>     supports allocations up to 60GB, so my second point reported below
>>>     may be invalid. Number 4 seems still seems curious to me, though.
>>>
>>>     Best
>>>     Joseph
>>>
>>>     On 08/25/2017 09:17 PM, Jeff Hammond wrote:
>>>
>>>         There's no reason to do anything special for shared memory with
>>>         a single-process job because
>>>         MPI_Win_allocate_shared(MPI_COMM_SELF) ~= MPI_Alloc_mem(). 
>>>         However, it would help debugging if MPI implementers at least
>>>         had an option to take the code path that allocates shared memory
>>>         even when np=1.
>>>
>>>         Jeff
>>>
>>>         On Thu, Aug 24, 2017 at 7:41 AM, Joseph Schuchart
>>>         <schuch...@hlrs.de <mailto:schuch...@hlrs.de>
>>>         <mailto:schuch...@hlrs.de <mailto:schuch...@hlrs.de>>> wrote:
>>>
>>>              Gilles,
>>>
>>>              Thanks for your swift response. On this system, /dev/shm
>>>         only has
>>>              256M available so that is no option unfortunately. I tried
>>>         disabling
>>>              both vader and sm btl via `--mca btl ^vader,sm` but Open
>>>         MPI still
>>>              seems to allocate the shmem backing file under /tmp. From
>>>         my point
>>>              of view, missing the performance benefits of file backed 
>>> shared
>>>              memory as long as large allocations work but I don't know 
>>> the
>>>              implementation details and whether that is possible. It
>>>         seems that
>>>              the mmap does not happen if there is only one process per 
>>> node.
>>>
>>>              Cheers,
>>>              Joseph
>>>
>>>
>>>              On 08/24/2017 03:49 PM, Gilles Gouaillardet wrote:
>>>
>>>                  Joseph,
>>>
>>>                  the error message suggests that allocating memory with
>>>                  MPI_Win_allocate[_shared] is done by creating a file
>>>         and then
>>>                  mmap'ing
>>>                  it.
>>>                  how much space do you have in /dev/shm ? (this is a
>>>         tmpfs e.g. a RAM
>>>                  file system)
>>>                  there is likely quite some space here, so as a
>>>         workaround, i suggest
>>>                  you use this as the shared-memory backing directory
>>>
>>>                  /* i am afk and do not remember the syntax, ompi_info
>>>         --all | grep
>>>                  backing is likely to help */
>>>
>>>                  Cheers,
>>>
>>>                  Gilles
>>>
>>>                  On Thu, Aug 24, 2017 at 10:31 PM, Joseph Schuchart
>>>                  <schuch...@hlrs.de <mailto:schuch...@hlrs.de>
>>>         <mailto:schuch...@hlrs.de <mailto:schuch...@hlrs.de>>> wrote:
>>>
>>>                      All,
>>>
>>>                      I have been experimenting with large window 
>>> allocations
>>>                      recently and have
>>>                      made some interesting observations that I would
>>>         like to share.
>>>
>>>                      The system under test:
>>>                          - Linux cluster equipped with IB,
>>>                          - Open MPI 2.1.1,
>>>                          - 128GB main memory per node
>>>                          - 6GB /tmp filesystem per node
>>>
>>>                      My observations:
>>>                      1) Running with 1 process on a single node, I can
>>>         allocate
>>>                      and write to
>>>                      memory up to ~110 GB through MPI_Allocate,
>>>         MPI_Win_allocate, and
>>>                      MPI_Win_allocate_shared.
>>>
>>>                      2) If running with 1 process per node on 2 nodes 
>>> single
>>>                      large allocations
>>>                      succeed but with the repeating allocate/free cycle
>>>         in the
>>>                      attached code I
>>>                      see the application being reproducibly being killed
>>>         by the
>>>                      OOM at 25GB
>>>                      allocation with MPI_Win_allocate_shared. When I try
>>>         to run
>>>                      it under Valgrind
>>>                      I get an error from MPI_Win_allocate at ~50GB that
>>>         I cannot
>>>                      make sense of:
>>>
>>>                      ```
>>>                      MPI_Alloc_mem:  53687091200 B
>>>                      [n131302:11989] *** An error occurred in 
>>> MPI_Alloc_mem
>>>                      [n131302:11989] *** reported by process 
>>> [1567293441,1]
>>>                      [n131302:11989] *** on communicator MPI_COMM_WORLD
>>>                      [n131302:11989] *** MPI_ERR_NO_MEM: out of memory
>>>                      [n131302:11989] *** MPI_ERRORS_ARE_FATAL (processes
>>>         in this
>>>                      communicator
>>>                      will now abort,
>>>                      [n131302:11989] ***    and potentially your MPI job)
>>>                      ```
>>>
>>>                      3) If running with 2 processes on a node, I get the
>>>                      following error from
>>>                      both MPI_Win_allocate and MPI_Win_allocate_shared:
>>>                      ```
>>>         
>>> -------------------------------------------------------------------------- 
>>>
>>>                      It appears as if there is not enough space for
>>>         
>>> /tmp/openmpi-sessions-31390@n131702_0/23041/1/0/shared_window_4.n131702
>>>                      (the
>>>                      shared-memory backing
>>>                      file). It is likely that your MPI job will now
>>>         either abort
>>>                      or experience
>>>                      performance degradation.
>>>
>>>                          Local host:  n131702
>>>                          Space Requested: 6710890760 B
>>>                          Space Available: 6433673216 B
>>>                      ```
>>>                      This seems to be related to the size limit of /tmp.
>>>                      MPI_Allocate works as
>>>                      expected, i.e., I can allocate ~50GB per process. I
>>>                      understand that I can
>>>                      set $TMP to a bigger filesystem (such as lustre)
>>>         but then I
>>>                      am greeted with
>>>                      a warning on each allocation and performance seems
>>>         to drop.
>>>                      Is there a way
>>>                      to fall back to the allocation strategy used in
>>>         case 2)?
>>>
>>>                      4) It is also worth noting the time it takes to
>>>         allocate the
>>>                      memory: while
>>>                      the allocations are in the sub-millisecond range
>>>         for both
>>>                      MPI_Allocate and
>>>                      MPI_Win_allocate_shared, it takes >24s to allocate
>>>         100GB using
>>>                      MPI_Win_allocate and the time increasing linearly
>>>         with the
>>>                      allocation size.
>>>
>>>                      Are these issues known? Maybe there is
>>>         documentation describing
>>>                      work-arounds? (esp. for 3) and 4))
>>>
>>>                      I am attaching a small benchmark. Please make sure
>>>         to adjust the
>>>                      MEM_PER_NODE macro to suit your system before you
>>>         run it :)
>>>                      I'm happy to
>>>                      provide additional details if needed.
>>>
>>>                      Best
>>>                      Joseph
>>>                      --
>>>                      Dipl.-Inf. Joseph Schuchart
>>>                      High Performance Computing Center Stuttgart (HLRS)
>>>                      Nobelstr. 19
>>>                      D-70569 Stuttgart
>>>
>>>                      Tel.: +49(0)711-68565890
>>>                      Fax: +49(0)711-6856832
>>>                      E-Mail: schuch...@hlrs.de
>>>         <mailto:schuch...@hlrs.de> <mailto:schuch...@hlrs.de
>>>         <mailto:schuch...@hlrs.de>>
>>>
>>>                      _______________________________________________
>>>                      users mailing list
>>>         users@lists.open-mpi.org <mailto:users@lists.open-mpi.org>
>>>         <mailto:users@lists.open-mpi.org 
>>> <mailto:users@lists.open-mpi.org>>
>>>         https://lists.open-mpi.org/mailman/listinfo/users
>>>         <https://lists.open-mpi.org/mailman/listinfo/users>
>>>                      <https://lists.open-mpi.org/mailman/listinfo/users
>>>         <https://lists.open-mpi.org/mailman/listinfo/users>>
>>>
>>>                  _______________________________________________
>>>                  users mailing list
>>>         users@lists.open-mpi.org <mailto:users@lists.open-mpi.org>
>>>         <mailto:users@lists.open-mpi.org 
>>> <mailto:users@lists.open-mpi.org>>
>>>         https://lists.open-mpi.org/mailman/listinfo/users
>>>         <https://lists.open-mpi.org/mailman/listinfo/users>
>>>                  <https://lists.open-mpi.org/mailman/listinfo/users
>>>         <https://lists.open-mpi.org/mailman/listinfo/users>>
>>>
>>>
>>>
>>>              --     Dipl.-Inf. Joseph Schuchart
>>>              High Performance Computing Center Stuttgart (HLRS)
>>>              Nobelstr. 19
>>>              D-70569 Stuttgart
>>>
>>>              Tel.: +49(0)711-68565890
>>>              Fax: +49(0)711-6856832
>>>              E-Mail: schuch...@hlrs.de <mailto:schuch...@hlrs.de>
>>>         <mailto:schuch...@hlrs.de <mailto:schuch...@hlrs.de>>
>>>              _______________________________________________
>>>              users mailing list
>>>         users@lists.open-mpi.org <mailto:users@lists.open-mpi.org>
>>>         <mailto:users@lists.open-mpi.org 
>>> <mailto:users@lists.open-mpi.org>>
>>>         https://lists.open-mpi.org/mailman/listinfo/users
>>>         <https://lists.open-mpi.org/mailman/listinfo/users>
>>>              <https://lists.open-mpi.org/mailman/listinfo/users
>>>         <https://lists.open-mpi.org/mailman/listinfo/users>>
>>>
>>>
>>>
>>>
>>>         --         Jeff Hammond
>>>         jeff.scie...@gmail.com <mailto:jeff.scie...@gmail.com>
>>>         <mailto:jeff.scie...@gmail.com <mailto:jeff.scie...@gmail.com>>
>>>         http://jeffhammond.github.io/
>>>
>>>
>>>         _______________________________________________
>>>         users mailing list
>>>         users@lists.open-mpi.org <mailto:users@lists.open-mpi.org>
>>>         https://lists.open-mpi.org/mailman/listinfo/users
>>>         <https://lists.open-mpi.org/mailman/listinfo/users>
>>>
>>>
>>>
>>>     --     Dipl.-Inf. Joseph Schuchart
>>>     High Performance Computing Center Stuttgart (HLRS)
>>>     Nobelstr. 19
>>>     D-70569 Stuttgart
>>>
>>>     Tel.: +49(0)711-68565890
>>>     Fax: +49(0)711-6856832
>>>     E-Mail: schuch...@hlrs.de <mailto:schuch...@hlrs.de>
>>>     _______________________________________________
>>>     users mailing list
>>>     users@lists.open-mpi.org <mailto:users@lists.open-mpi.org>
>>>     https://lists.open-mpi.org/mailman/listinfo/users
>>>     <https://lists.open-mpi.org/mailman/listinfo/users>
>>>
>>>
>>>
>>>
>>> -- 
>>> Jeff Hammond
>>> jeff.scie...@gmail.com <mailto:jeff.scie...@gmail.com>
>>> http://jeffhammond.github.io/
>>>
>>>
>>> _______________________________________________
>>> users mailing list
>>> users@lists.open-mpi.org
>>> https://lists.open-mpi.org/mailman/listinfo/users
>>>
>> 
>> 
>
>
>-- 
>Dipl.-Inf. Joseph Schuchart
>High Performance Computing Center Stuttgart (HLRS)
>Nobelstr. 19
>D-70569 Stuttgart
>
>Tel.: +49(0)711-68565890
>Fax: +49(0)711-6856832
>E-Mail: schuch...@hlrs.de
>_______________________________________________
>users mailing list
>users@lists.open-mpi.org
>https://lists.open-mpi.org/mailman/listinfo/users
_______________________________________________
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Reply via email to