Dear all,

We are hitting the following error when running a simple Open MPI “Hello World” 
with UCX 1.16 or newer and Open MPI 5.0.x and some 4.1.5+ versions on a single 
node:

rcache.c:248  UCX  ERROR   mmap(size=151552) failed: Cannot allocate memory
pgtable.c:75   Fatal: Failed to allocate page table directory
*** Process received signal ***
Signal: Aborted (6)
Signal code:  (-6)

This is on CentOS 8.10, kernel 4.18, 192 GB RAM, Intel Xeon Gold 6138 
(dual-socket Skylake, 40 cores). The failure is reproducible only when using 
more than 20-24 MPI ranks; fewer than 20 ranks work fine. Older UCX versions on 
the same system (e.g. 1.12) do not show this issue.

The issue also goes away if we run Open MPI with the ob1 PML (without UCX) or 
disable for the UCX PML some of the TLS with UCX_TLS=^shm or UCX_TLS=^ib.

Has anyone seen similar "mmap failed / Failed to allocate page table directory" 
errors with UCX > 1.15 and Open MPI 4.1.x/5.0.x, or is aware of known 
regressions or configuration pitfalls (e.g. rcache, huge pages, memtype cache, 
or other UCX/Open MPI memory-related settings)? Are there specific UCX 
environment variables or OMPI MCA parameters you would recommend trying to 
diagnose this further?

I can provide full ompi_info, ucx_info, build options, and more complete logs 
if that is helpful.


Many thanks in advance for any hints or suggestions.


Best regards,
Christoph Niethammer

--

Dr.-Ing. Christoph Niethammer
High Performance Computing Center Stuttgart (HLRS)
Nobelstrasse 19
70569 Stuttgart

Tel: ++49(0)711-685-87203
email: [email protected]
https://www.hlrs.de/people/christoph-niethammer

To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].

Reply via email to