Bonjour,
I am afraid I got a weird issue when running an OpenMPI job using OpenIB
on an SGI ICE cluster with 4096 cores (or larger), and the FAQ does not help.
The OMPI version is 1.4.1, and it is running just fine with a smaller number of
cores (up to 512).
The error message is the following :
--------------------------------------------------------------------------
The OpenFabrics (openib) BTL failed to initialize while trying to
allocate some locked memory. This typically can indicate that the
memlock limits are set too low. For most HPC installations, the
memlock limits should be set to "unlimited". The failure occured
here:
Local host: r25i1n0
OMPI source: btl_openib.c:169
Function: ibv_create_cq()
Device: mlx4_0
Memlock limit: unlimited
You may need to consult with your system administrator to get this
problem fixed. This FAQ entry on the Open MPI web site may also be
helpful:
http://www.open-mpi.org/faq/?category=openfabrics#ib-locked-pages
--------------------------------------------------------------------------
This is a rather usual message, from the FAQ, but you probably noticed
the 'unlimited' value for memlock, which should not lead to any trouble.
So, what's wrong there ?
The ompi_info follows:
I'm starting the application like this :
mpiexec -mca btl openib,sm,self -mca mpi_leave_pinned 1 -mca orte_tmpdir_base
/home/grodid/pbs.776824.service0.x8z/tmp My_App
I checked that indeed the memlock limt is just fine on the nodes thru this
command:
mpiexec -mca btl openib,sm,self -mca mpi_leave_pinned 1 -mca orte_tmpdir_base
${PBS_JOBDIR}/tmp /usr/bin/tcsh -c limit
which provides this output:
cputime unlimited
filesize unlimited
datasize unlimited
stacksize unlimited
coredumpsize 0 kbytes
memoryuse 31457280 kbytes
vmemoryuse unlimited
descriptors 16384
memorylocked unlimited
maxproc 303104
The OS of the working nodes is :
cat /etc/SuSE-release
SUSE Linux Enterprise Server 10 (x86_64)
VERSION = 10
PATCHLEVEL = 2
What's wrong then ?
Any help welcome, Regards, Gilbert.