Hi Seb,
I'm glad you caught the "make contribs" thing. Now that you mention it,
that has caught me in the past.
As far as the inter-core communication, I suggest asking for help on the
mvapich2 list. PMI2 should only be involved in job setup, not while the
job is running. It appears that one transport is being invoked when you
use mpirun, and a different one when you use srun.
Andy
On 09/30/2017 09:55 AM, Sebastian Eastham wrote:
Hi Andy,
I see your point regarding the age of our slurm build. On that basis,
I decided to try your suggestion, and built a new copy of slurm
(17.02.11). Interestingly, mvapich2 failed to build for the same
reason – it still could not find pmi2.h. However, since this was a
local slurm installation, I had greater license to try something that
I saw elsewhere on the slurm discussion board, and decided to go to
the contribs/pmi2 directory in the slurm source and run “make”
followed by “make install”. This appears to have partially resolved
the problem, at least in the sense that MVAPICH2 no longer chokes at
that point, successfully finding the pmi2.h file. On this basis I went
back to the server’s antiquated slurm build and performed the same
process – going to the original 14.11 source directory, entering
contribs/pmi2, and running make followed by make install. This appears
to have added the relevant PMI2 files to slurm without needing a
restart, and MVAPICH2 now builds with the “--with-pmi=pmi2” option!
Unfortunately, it looks like this has all been for nought, as I still
get the same poor performance with MVAPICH2 when running with slurm,
with much better performance when running “bare” (i.e. using mpirun on
a disconnected node). The issue appears to be in the inter-core
communication. Calculations on independent cores are just as fast when
using slurm or MVAPICH2’s mpirun, but work involving inter-core
communication is an order of magnitude slower when using srun than
when using mpirun. I had hoped that PMI-2 would help, but it appears
not. While I know this is a rather broad question, if you have any
ideas as to what might cause this kind of slowdown I would greatly
appreciate them! Hopefully this is some minor setup flaw that I’ve
introduced, but if there was some change in a more recent version of
slurm that could explain this behavior then I will bite the bullet and
upgrade to slurm v17 whenever we have a quiet weekend on the cluster.
Regards,
Seb
*From:*Andy Riebs [mailto:andy.ri...@hpe.com]
*Sent:* Friday, September 29, 2017 12:54 PM
*To:* slurm-dev <slurm-dev@schedmd.com>
*Subject:* [slurm-dev] Re: slurm with PMI2
Hi Seb,
Is there any chance that Slurm 14.11 put pmi2.h somewhere else,
perhaps dropping the "slurm/" and putting it straight into, say,
/usr/include/pmi2.h?
I hate to say it, but Slurm 14.11 is nearly 3 years old, and there is
a new major release of Slurm every 6 months or so. I've been using
16.05 and 17.11.0-0pre2 with that recipe for mvapich2, and no
pmi2-specific build options for Slurm, with success.
You might try building a private instance of a current version of
Slurm (probably 17.02.x), and then see if you can build mvapich2
against that. This would save the disruption of actually replacing
Slurm on your system until you had some reason to think it would help.
And if it works, go for it!
Andy
On 09/29/2017 12:40 PM, Sebastian Eastham wrote:
Thanks Andy! Do you happen to know if you took any special steps
when building slurm to ensure that pmi2 support was present? At
this end, my MVAPICH2 configure (when I try to use pmi2) is
./configure --with-pm=slurm --prefix=${installDir}
--with-slurm=/opt/slurm --with-pmi=pmi2
However, as long as the “--with-pmi=pmi2” option is included, it
fails with “could not find slurm/pmi2.h”. Unfortunately
“--enable-slurm=yes” did not resolve the issue; the same is true
for directly specifying the slurm include and library paths
(--with-slurm-include and --with-slurm-lib). My suspicion is that
our installation of slurm currently lacks the necessary PMI2
header file and libraries, but I am not clear on how to install
these. If there is any way to install them without reinstalling
slurm then that would be ideal, but if a reinstall is necessary
then that too can be scheduled.
Regards,
Seb
*From:*Andy Riebs [mailto:andy.ri...@hpe.com]
*Sent:* Friday, September 29, 2017 12:30 PM
*To:* slurm-dev <slurm-dev@schedmd.com> <mailto:slurm-dev@schedmd.com>
*Subject:* [slurm-dev] Re: slurm with PMI2
FWIW, we include these options when we build mvapich2:
--with-pmi=pmi2 \
--with-pm=slurm \
--with-slurm=/opt/slurm \
--enable-slurm=yes"
It feels like there is some redundancy there, but it works!
Andy
On 09/29/2017 12:12 PM, Sebastian Eastham wrote:
Dear Slurm Developers mailing list,
I was hoping for a quick clarification regarding PMI2 support
in slurm. We are running slurm v14.11.5, and installing
MVAPICH2 v2.3b, which is listed as preferring PMI2. However,
we found upon trying to configure MVAPICH2 that we could not
use the flag “--with-pmi=pmi2”, as this resulted in the error
“could not find slurm/pmi2.h”. As a result, we built MVAPICH2
without this flag. Strangely, we can successfully run MVAPICH2
MPI jobs with slurm, using srun -n $SLURM_NTASKS --mpi=pmi2
./my_binary, but code which requires communication between
cores is running extremely slowly. My assumption is that this
is because PMI2 is not actually being used, in spite of the
--mpi-pmi2 flag for srun.
My questions are:
1. I notice that “srun --mpi=list” shows pmi2 as an option,
and that I can launch srun with the flag --mpi=pmi2.
However, given that our installed slurm/lib directory does
not contain a libpmi2.so file (only libpmi.*/libslurm*.*
etc), and I cannot find pmi2.h installed anywhere, am I
right in thinking that our as-installed version of slurm
does not support PMI2? Some of the archived mailing list
posts seem to support this conclusion, but I was not sure.
2. If indeed we do not have pmi2 support, what is the
procedure to upgrade our build to include pmi2 support?
Can this be done without fully reinstalling slurm? If so,
how? If not, what additional steps will need to be taken
on the reinstall to ensure that slurm has the required
pmi2 support?
I appreciate any help or guidance that you can give me!
Regards,
Seb
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
Dr. Sebastian D. Eastham
Research Scientist
Laboratory for Aviation and the Environment
Massachusetts Institute of Technology
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=