Christoph, As you indicated that disabling UCX makes this issue go away, it seems the memory exhaustion is arising from UCX. I have limited knowledge about the UCX internals, when I need to change its behavior I use `ucx_info -c` and then dig into the output. For a better answer I would suggest asking on the UCX github.
Best, George. On Fri, Mar 6, 2026 at 5:16 AM Christoph Niethammer <[email protected]> wrote: > Hello George, > > thanks for the suggestion. > > On ou rsystem vm.max_map_count is currently 64k. > > I should also mention that memory overcommit is diabled > (vm.overcommit_memory = 2). > If we change this setting to allow memory overcommit (vm.overcommit_memory > = 0 or 1), the issue dissapears. > > However, it still looks somewhat surprising that a simple “Hello World” > application with ~20 MPI processes > already triggers this behaviour. Given that the system has 192 GB of RAM, > it does not seem obvious why startup > would fail due to memory allocation at such a small scale. > > Ideally we would prefer to keep memory overcommit disabled, since it helps > detect memory issues in user applications > early rather than failing later at runtime. > > Is there a way to influence this memory exhausting behaviour with some > settings in OpenMPI (or UCX compinents). > So far we also experimented with adjusting the UCX FIFO sizes, since the > defaults changed in newer UCX releases. > In particular we tried restoring the older values used in UCX 1.15: > UCX_POSIX_FIFO_SIZE=64, UCX_SYSV_FIFO_SIZE=64, UCX_XPMEM_FIFO_SIZE=64. > Unfortunately this did not resolve the issue. > > Are there other UCX parameters (e.g. related to shared-memory transports, > rcache behaviour, or memtype cache) or > Open MPI MCA parameters that could reduce the number of memory mappings or > the amount of virtual memory reserved > during startup? > > Any suggestions for further debugging or configuration options to try > would be highly appreciated. > > Best regards, > Christoph > > ----- Original Message ----- > From: "Open MPI Users" <[email protected]> > To: "Open MPI Users" <[email protected]> > Sent: Wednesday, 4 March, 2026 17:19:00 > Subject: Re: [OMPI users] "Cannot allocate memory” / pgtable failure with > Open MPI and UCX 1.16 or newer > > It looks like some form of resource exhaustion, possibly exceeding the > number of entries into the mmap table. What is the value of > `vm.max_map_count` on this system ? You can obtain it with `sysctl > vm.max_map_count` or `cat /proc/sys/vm/max_map_count`. > > George > > > On Wed, Mar 4, 2026 at 4:25 AM Christoph Niethammer <[email protected]> > wrote: > > > Dear all, > > > > We are hitting the following error when running a simple Open MPI “Hello > > World” with UCX 1.16 or newer and Open MPI 5.0.x and some 4.1.5+ versions > > on a single node: > > > > rcache.c:248 UCX ERROR mmap(size=151552) failed: Cannot allocate > memory > > pgtable.c:75 Fatal: Failed to allocate page table directory > > *** Process received signal *** > > Signal: Aborted (6) > > Signal code: (-6) > > > > This is on CentOS 8.10, kernel 4.18, 192 GB RAM, Intel Xeon Gold 6138 > > (dual-socket Skylake, 40 cores). The failure is reproducible only when > > using more than 20-24 MPI ranks; fewer than 20 ranks work fine. Older UCX > > versions on the same system (e.g. 1.12) do not show this issue. > > > > The issue also goes away if we run Open MPI with the ob1 PML (without > UCX) > > or disable for the UCX PML some of the TLS with UCX_TLS=^shm or > UCX_TLS=^ib. > > > > Has anyone seen similar "mmap failed / Failed to allocate page table > > directory" errors with UCX > 1.15 and Open MPI 4.1.x/5.0.x, or is aware > of > > known regressions or configuration pitfalls (e.g. rcache, huge pages, > > memtype cache, or other UCX/Open MPI memory-related settings)? Are there > > specific UCX environment variables or OMPI MCA parameters you would > > recommend trying to diagnose this further? > > > > I can provide full ompi_info, ucx_info, build options, and more complete > > logs if that is helpful. > > > > > > Many thanks in advance for any hints or suggestions. > > > > > > Best regards, > > Christoph Niethammer > > > > -- > > > > Dr.-Ing. Christoph Niethammer > > High Performance Computing Center Stuttgart (HLRS) > > Nobelstrasse 19 > > 70569 Stuttgart > > > > Tel: ++49(0)711-685-87203 > > email: [email protected] > > https://www.hlrs.de/people/christoph-niethammer > > > > To unsubscribe from this group and stop receiving emails from it, send an > > email to [email protected]. > > > > > > To unsubscribe from this group and stop receiving emails from it, send an > email to [email protected]. > > To unsubscribe from this group and stop receiving emails from it, send an > email to [email protected]. > > To unsubscribe from this group and stop receiving emails from it, send an email to [email protected].
