Christoph,

As you indicated that disabling UCX makes this issue go away, it seems
the memory exhaustion is arising from UCX. I have limited knowledge about
the UCX internals, when I need to change its behavior I use `ucx_info -c`
and then dig into the output. For a better answer I would suggest asking on
the UCX github.

Best,
  George.


On Fri, Mar 6, 2026 at 5:16 AM Christoph Niethammer <[email protected]>
wrote:

> Hello George,
>
> thanks for the suggestion.
>
> On ou rsystem vm.max_map_count is currently 64k.
>
> I should also mention that memory overcommit is diabled
> (vm.overcommit_memory = 2).
> If we change this setting to allow memory overcommit (vm.overcommit_memory
> = 0 or 1), the issue dissapears.
>
> However, it still looks somewhat surprising that a simple “Hello World”
> application with ~20 MPI processes
> already triggers this behaviour. Given that the system has 192 GB of RAM,
> it does not seem obvious why startup
> would fail due to memory allocation at such a small scale.
>
> Ideally we would prefer to keep memory overcommit disabled, since it helps
> detect memory issues in user applications
> early rather than failing later at runtime.
>
> Is there a way to influence this memory exhausting behaviour with some
> settings in OpenMPI (or UCX compinents).
> So far we also experimented with adjusting the UCX FIFO sizes, since the
> defaults changed in newer UCX releases.
> In particular we tried restoring the older values used in UCX 1.15:
> UCX_POSIX_FIFO_SIZE=64, UCX_SYSV_FIFO_SIZE=64, UCX_XPMEM_FIFO_SIZE=64.
> Unfortunately this did not resolve the issue.
>
> Are there other UCX parameters (e.g. related to shared-memory transports,
> rcache behaviour, or memtype cache) or
> Open MPI MCA parameters that could reduce the number of memory mappings or
> the amount of virtual memory reserved
> during startup?
>
> Any suggestions for further debugging or configuration options to try
> would be highly appreciated.
>
> Best regards,
> Christoph
>
> ----- Original Message -----
> From: "Open MPI Users" <[email protected]>
> To: "Open MPI Users" <[email protected]>
> Sent: Wednesday, 4 March, 2026 17:19:00
> Subject: Re: [OMPI users] "Cannot allocate memory” / pgtable failure with
> Open MPI and UCX 1.16 or newer
>
> It looks like some form of resource exhaustion, possibly exceeding the
> number of entries into the mmap table. What is the value of
> `vm.max_map_count` on this system ? You can obtain it with `sysctl
> vm.max_map_count` or `cat /proc/sys/vm/max_map_count`.
>
>   George
>
>
> On Wed, Mar 4, 2026 at 4:25 AM Christoph Niethammer <[email protected]>
> wrote:
>
> > Dear all,
> >
> > We are hitting the following error when running a simple Open MPI “Hello
> > World” with UCX 1.16 or newer and Open MPI 5.0.x and some 4.1.5+ versions
> > on a single node:
> >
> > rcache.c:248  UCX  ERROR   mmap(size=151552) failed: Cannot allocate
> memory
> > pgtable.c:75   Fatal: Failed to allocate page table directory
> > *** Process received signal ***
> > Signal: Aborted (6)
> > Signal code:  (-6)
> >
> > This is on CentOS 8.10, kernel 4.18, 192 GB RAM, Intel Xeon Gold 6138
> > (dual-socket Skylake, 40 cores). The failure is reproducible only when
> > using more than 20-24 MPI ranks; fewer than 20 ranks work fine. Older UCX
> > versions on the same system (e.g. 1.12) do not show this issue.
> >
> > The issue also goes away if we run Open MPI with the ob1 PML (without
> UCX)
> > or disable for the UCX PML some of the TLS with UCX_TLS=^shm or
> UCX_TLS=^ib.
> >
> > Has anyone seen similar "mmap failed / Failed to allocate page table
> > directory" errors with UCX > 1.15 and Open MPI 4.1.x/5.0.x, or is aware
> of
> > known regressions or configuration pitfalls (e.g. rcache, huge pages,
> > memtype cache, or other UCX/Open MPI memory-related settings)? Are there
> > specific UCX environment variables or OMPI MCA parameters you would
> > recommend trying to diagnose this further?
> >
> > I can provide full ompi_info, ucx_info, build options, and more complete
> > logs if that is helpful.
> >
> >
> > Many thanks in advance for any hints or suggestions.
> >
> >
> > Best regards,
> > Christoph Niethammer
> >
> > --
> >
> > Dr.-Ing. Christoph Niethammer
> > High Performance Computing Center Stuttgart (HLRS)
> > Nobelstrasse 19
> > 70569 Stuttgart
> >
> > Tel: ++49(0)711-685-87203
> > email: [email protected]
> > https://www.hlrs.de/people/christoph-niethammer
> >
> > To unsubscribe from this group and stop receiving emails from it, send an
> > email to [email protected].
> >
> >
>
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
>
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
>
>

To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].

Reply via email to