Matt et al,

We were seeing an issue with VASP 5.2.8 compiled with Intel's 2011 SP1
6.233 and MPI version 4.0.3 where Slurm was properly assigning
resources, yet mpirun wasn't scaling correctly; nodes were being
allocated and weren't being used at all.  These jobs also greedily
used all available Qlogic contexts on the IB HCA.

We found that if a job was submit with flags like so:

-N 4 --ntasks-per-node=4
mpirun -ppn $SLURM_NTASKS_PER_NODE -np $SLURM_NTASKS vasp

The job scaled properly and the expected behavior was seen, e.g. 4
processes on 4 nodes.  We didn't need to use a hosts file, and we also
denied the software the ability to use all available free contexts.

When we deliberately left off "-ppn" with mpirun, we saw the same
behavior we saw earlier where nodes allocated to the job weren't even
used.

This may or may not apply in this case, but I'd definitely try it out
and see what kind of results you get.

John DeSantis



2015-05-01 12:41 GMT-04:00 Will French <[email protected]>:
>
>
>
>>
>> If you use modules, perhaps you could detect when the module is loaded from 
>> a gateway and not set I_MPI_PMI_LIBRARY there. If you're not using SLURM on 
>> your gateways for interactive jobs then it becomes even easier-- just make 
>> I_MPI_PMI_LIBRARY conditional on finding one of the SLURM_* job variables.
>>
>
>
> Yeah, I’ve thought about solutions like this but this also assumes that a 
> user will always use srun rather than Intel’s mpirun/mpiexec in a SLURM job. 
> I suppose that’s the best we can do. Thanks for the reply.

Reply via email to