I've setup several clusters over the years with OpenMPI.  I often get the below
error:

   WARNING: It appears that your OpenFabrics subsystem is configured to only
   allow registering part of your physical memory.  This can cause MPI jobs to
   run with erratic performance, hang, and/or crash.
   ...
   http://www.open-mpi.org/faq/?category=openfabrics#ib-locked-pages

     Local host:              c2-31
     Registerable memory:     32768 MiB
     Total memory:            64398 MiB

I'm well aware of the normal fixes, and have implemented them in puppet to
ensure compute nodes get the changes.  To be paranoid I've implemented all the
changes, and they all worked under ubuntu 13.10.

However with ubuntu 14.04 it seems like it's not working, thus the above 
message.

As recommended by the faq's I've implemented:
1) ulimit -l unlimited in /etc/profile.d/slurm.sh
2) PropagateResourceLimitsExcept=MEMLOCK in slurm.conf
3) UsePAM=1 in slurm.conf
4) in /etc/security/limits.conf
   * hard memlock unlimited
   * soft memlock unlimited
   * hard stack unlimited
   * soft stack unlimited

My changes seem to be working, of I submit this to slurm:
#!/bin/bash -l
ulimit -l
hostname
mpirun bash -c ulimit -l
mpirun ./relay 1 131072

I get:
   unlimited
   c2-31
   unlimited
   unlimited
   unlimited
   unlimited
   <above error message only 32GB of Registerable memory>
   <output of mpirun relay>

Is there some new kernel parameter, ofed parameter, or similar that controls
locked pages now?  The kernel is 3.13.0-36 and the libopenmpi-dev package is 
1.6.5.

Since the ulimit -l is getting to both the slurm launched script and also to the
mpirun launched binaries I'm pretty puzzled.

Any suggestions?

Reply via email to