Re: [OMPI users] Issues with Large Window Allocations

Joseph Schuchart Mon, 04 Sep 2017 07:14:43 -0700

Gilles,

On 09/04/2017 03:22 PM, Gilles Gouaillardet wrote:

Joseph,


please open a github issue regarding the SIGBUS error.


Done: https://github.com/open-mpi/ompi/issues/4166


as far as i understand, MAP_ANONYMOUS+MAP_SHARED can only be used
between related processes. (e.g. parent and children)
in the case of Open MPI, MPI tasks are siblings, so this is not an option.

You are right, it doesn't work the way I expected. Should have tested itbefore :)


Best
Joseph

Cheers,

Gilles


On Mon, Sep 4, 2017 at 10:13 PM, Joseph Schuchart <schuch...@hlrs.de> wrote:

Jeff, all,

Unfortunately, I (as a user) have no control over the page size on our
cluster. My interest in this is more of a general nature because I am
concerned that our users who use Open MPI underneath our code run into this
issue on their machine.

I took a look at the code for the various window creation methods and now
have a better picture of the allocation process in Open MPI. I realized that
memory in windows allocated through MPI_Win_alloc or created through
MPI_Win_create is registered with the IB device using ibv_reg_mr, which
takes significant time for large allocations (I assume this is where
hugepages would help?). In contrast to this, it seems that memory attached
through MPI_Win_attach is not registered, which explains the lower latency
for these allocation I am observing (I seem to remember having observed
higher communication latencies as well).

Regarding the size limitation of /tmp: I found an opal/mca/shmem/posix
component that uses shmem_open to create a POSIX shared memory object
instead of a file on disk, which is then mmap'ed. Unfortunately, if I raise
the priority of this component above that of the default mmap component I
end up with a SIGBUS during MPI_Init. No other errors are reported by MPI.
Should I open a ticket on Github for this?

As an alternative, would it be possible to use anonymous shared memory
mappings to avoid the backing file for large allocations (maybe above a
certain threshold) on systems that support MAP_ANONYMOUS and distribute the
result of the mmap call among the processes on the node?

Thanks,
Joseph

On 08/29/2017 06:12 PM, Jeff Hammond wrote:


I don't know any reason why you shouldn't be able to use IB for intra-node
transfers.  There are, of course, arguments against doing it in general
(e.g. IB/PCI bandwidth less than DDR4 bandwidth), but it likely behaves less
synchronously than shared-memory, since I'm not aware of any MPI RMA library
that dispatches the intranode RMA operations to an asynchronous agent (e.g.
communication helper thread).

Regarding 4, faulting 100GB in 24s corresponds to 1us per 4K page, which
doesn't sound unreasonable to me.  You might investigate if/how you can use
2M or 1G pages instead.  It's possible Open-MPI already supports this, if
the underlying system does.  You may need to twiddle your OS settings to get
hugetlbfs working.

Jeff

On Tue, Aug 29, 2017 at 6:15 AM, Joseph Schuchart <schuch...@hlrs.de
<mailto:schuch...@hlrs.de>> wrote:

     Jeff, all,

     Thanks for the clarification. My measurements show that global
     memory allocations do not require the backing file if there is only
     one process per node, for arbitrary number of processes. So I was
     wondering if it was possible to use the same allocation process even
     with multiple processes per node if there is not enough space
     available in /tmp. However, I am not sure whether the IB devices can
     be used to perform intra-node RMA. At least that would retain the
     functionality on this kind of system (which arguably might be a rare
     case).

     On a different note, I found during the weekend that Valgrind only
     supports allocations up to 60GB, so my second point reported below
     may be invalid. Number 4 seems still seems curious to me, though.

     Best
     Joseph

     On 08/25/2017 09:17 PM, Jeff Hammond wrote:

         There's no reason to do anything special for shared memory with
         a single-process job because
         MPI_Win_allocate_shared(MPI_COMM_SELF) ~= MPI_Alloc_mem().
However, it would help debugging if MPI implementers at least
         had an option to take the code path that allocates shared memory
         even when np=1.

         Jeff

         On Thu, Aug 24, 2017 at 7:41 AM, Joseph Schuchart
         <schuch...@hlrs.de <mailto:schuch...@hlrs.de>
         <mailto:schuch...@hlrs.de <mailto:schuch...@hlrs.de>>> wrote:

              Gilles,

              Thanks for your swift response. On this system, /dev/shm
         only has
              256M available so that is no option unfortunately. I tried
         disabling
              both vader and sm btl via `--mca btl ^vader,sm` but Open
         MPI still
              seems to allocate the shmem backing file under /tmp. From
         my point
              of view, missing the performance benefits of file backed
shared
              memory as long as large allocations work but I don't know the
              implementation details and whether that is possible. It
         seems that
              the mmap does not happen if there is only one process per
node.

              Cheers,
              Joseph


              On 08/24/2017 03:49 PM, Gilles Gouaillardet wrote:

                  Joseph,

                  the error message suggests that allocating memory with
                  MPI_Win_allocate[_shared] is done by creating a file
         and then
                  mmap'ing
                  it.
                  how much space do you have in /dev/shm ? (this is a
         tmpfs e.g. a RAM
                  file system)
                  there is likely quite some space here, so as a
         workaround, i suggest
                  you use this as the shared-memory backing directory

                  /* i am afk and do not remember the syntax, ompi_info
         --all | grep
                  backing is likely to help */

                  Cheers,

                  Gilles

                  On Thu, Aug 24, 2017 at 10:31 PM, Joseph Schuchart
                  <schuch...@hlrs.de <mailto:schuch...@hlrs.de>
         <mailto:schuch...@hlrs.de <mailto:schuch...@hlrs.de>>> wrote:

                      All,

                      I have been experimenting with large window
allocations
                      recently and have
                      made some interesting observations that I would
         like to share.

                      The system under test:
                          - Linux cluster equipped with IB,
                          - Open MPI 2.1.1,
                          - 128GB main memory per node
                          - 6GB /tmp filesystem per node

                      My observations:
                      1) Running with 1 process on a single node, I can
         allocate
                      and write to
                      memory up to ~110 GB through MPI_Allocate,
         MPI_Win_allocate, and
                      MPI_Win_allocate_shared.

                      2) If running with 1 process per node on 2 nodes
single
                      large allocations
                      succeed but with the repeating allocate/free cycle
         in the
                      attached code I
                      see the application being reproducibly being killed
         by the
                      OOM at 25GB
                      allocation with MPI_Win_allocate_shared. When I try
         to run
                      it under Valgrind
                      I get an error from MPI_Win_allocate at ~50GB that
         I cannot
                      make sense of:

                      ```
                      MPI_Alloc_mem:  53687091200 B
                      [n131302:11989] *** An error occurred in
MPI_Alloc_mem
                      [n131302:11989] *** reported by process
[1567293441,1]
                      [n131302:11989] *** on communicator MPI_COMM_WORLD
                      [n131302:11989] *** MPI_ERR_NO_MEM: out of memory
                      [n131302:11989] *** MPI_ERRORS_ARE_FATAL (processes
         in this
                      communicator
                      will now abort,
                      [n131302:11989] ***    and potentially your MPI job)
                      ```

                      3) If running with 2 processes on a node, I get the
                      following error from
                      both MPI_Win_allocate and MPI_Win_allocate_shared:
                      ```

--------------------------------------------------------------------------
                      It appears as if there is not enough space for

/tmp/openmpi-sessions-31390@n131702_0/23041/1/0/shared_window_4.n131702
                      (the
                      shared-memory backing
                      file). It is likely that your MPI job will now
         either abort
                      or experience
                      performance degradation.

                          Local host:  n131702
                          Space Requested: 6710890760 B
                          Space Available: 6433673216 B
                      ```
                      This seems to be related to the size limit of /tmp.
                      MPI_Allocate works as
                      expected, i.e., I can allocate ~50GB per process. I
                      understand that I can
                      set $TMP to a bigger filesystem (such as lustre)
         but then I
                      am greeted with
                      a warning on each allocation and performance seems
         to drop.
                      Is there a way
                      to fall back to the allocation strategy used in
         case 2)?

                      4) It is also worth noting the time it takes to
         allocate the
                      memory: while
                      the allocations are in the sub-millisecond range
         for both
                      MPI_Allocate and
                      MPI_Win_allocate_shared, it takes >24s to allocate
         100GB using
                      MPI_Win_allocate and the time increasing linearly
         with the
                      allocation size.

                      Are these issues known? Maybe there is
         documentation describing
                      work-arounds? (esp. for 3) and 4))

                      I am attaching a small benchmark. Please make sure
         to adjust the
                      MEM_PER_NODE macro to suit your system before you
         run it :)
                      I'm happy to
                      provide additional details if needed.

                      Best
                      Joseph
                      --
                      Dipl.-Inf. Joseph Schuchart
                      High Performance Computing Center Stuttgart (HLRS)
                      Nobelstr. 19
                      D-70569 Stuttgart

                      Tel.: +49(0)711-68565890
                      Fax: +49(0)711-6856832
                      E-Mail: schuch...@hlrs.de
         <mailto:schuch...@hlrs.de> <mailto:schuch...@hlrs.de
         <mailto:schuch...@hlrs.de>>

                      _______________________________________________
                      users mailing list
         users@lists.open-mpi.org <mailto:users@lists.open-mpi.org>
         <mailto:users@lists.open-mpi.org
<mailto:users@lists.open-mpi.org>>
         https://lists.open-mpi.org/mailman/listinfo/users
         <https://lists.open-mpi.org/mailman/listinfo/users>
                      <https://lists.open-mpi.org/mailman/listinfo/users
         <https://lists.open-mpi.org/mailman/listinfo/users>>

                  _______________________________________________
                  users mailing list
         users@lists.open-mpi.org <mailto:users@lists.open-mpi.org>
         <mailto:users@lists.open-mpi.org
<mailto:users@lists.open-mpi.org>>
         https://lists.open-mpi.org/mailman/listinfo/users
         <https://lists.open-mpi.org/mailman/listinfo/users>
                  <https://lists.open-mpi.org/mailman/listinfo/users
         <https://lists.open-mpi.org/mailman/listinfo/users>>



              --     Dipl.-Inf. Joseph Schuchart
              High Performance Computing Center Stuttgart (HLRS)
              Nobelstr. 19
              D-70569 Stuttgart

              Tel.: +49(0)711-68565890
              Fax: +49(0)711-6856832
              E-Mail: schuch...@hlrs.de <mailto:schuch...@hlrs.de>
         <mailto:schuch...@hlrs.de <mailto:schuch...@hlrs.de>>
              _______________________________________________
              users mailing list
         users@lists.open-mpi.org <mailto:users@lists.open-mpi.org>
         <mailto:users@lists.open-mpi.org
<mailto:users@lists.open-mpi.org>>
         https://lists.open-mpi.org/mailman/listinfo/users
         <https://lists.open-mpi.org/mailman/listinfo/users>
              <https://lists.open-mpi.org/mailman/listinfo/users
         <https://lists.open-mpi.org/mailman/listinfo/users>>




         --         Jeff Hammond
         jeff.scie...@gmail.com <mailto:jeff.scie...@gmail.com>
         <mailto:jeff.scie...@gmail.com <mailto:jeff.scie...@gmail.com>>

         http://jeffhammond.github.io/


         _______________________________________________
         users mailing list
         users@lists.open-mpi.org <mailto:users@lists.open-mpi.org>
         https://lists.open-mpi.org/mailman/listinfo/users
         <https://lists.open-mpi.org/mailman/listinfo/users>



     --     Dipl.-Inf. Joseph Schuchart
     High Performance Computing Center Stuttgart (HLRS)
     Nobelstr. 19
     D-70569 Stuttgart

     Tel.: +49(0)711-68565890
     Fax: +49(0)711-6856832
     E-Mail: schuch...@hlrs.de <mailto:schuch...@hlrs.de>
     _______________________________________________
     users mailing list
     users@lists.open-mpi.org <mailto:users@lists.open-mpi.org>
     https://lists.open-mpi.org/mailman/listinfo/users
     <https://lists.open-mpi.org/mailman/listinfo/users>




--
Jeff Hammond
jeff.scie...@gmail.com <mailto:jeff.scie...@gmail.com>
http://jeffhammond.github.io/


_______________________________________________
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users



--
Dipl.-Inf. Joseph Schuchart
High Performance Computing Center Stuttgart (HLRS)
Nobelstr. 19
D-70569 Stuttgart

Tel.: +49(0)711-68565890
Fax: +49(0)711-6856832
E-Mail: schuch...@hlrs.de
_______________________________________________
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

_______________________________________________
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users



--
Dipl.-Inf. Joseph Schuchart
High Performance Computing Center Stuttgart (HLRS)
Nobelstr. 19
D-70569 Stuttgart

Tel.: +49(0)711-68565890
Fax: +49(0)711-6856832
E-Mail: schuch...@hlrs.de
_______________________________________________
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Re: [OMPI users] Issues with Large Window Allocations

Reply via email to