Hi Seb,

I'm glad you caught the "make contribs" thing. Now that you mention it, that has caught me in the past.

As far as the inter-core communication, I suggest asking for help on the mvapich2 list. PMI2 should only be involved in job setup, not while the job is running. It appears that one transport is being invoked when you use mpirun, and a different one when you use srun.

Andy

On 09/30/2017 09:55 AM, Sebastian Eastham wrote:

Hi Andy,

I see your point regarding the age of our slurm build. On that basis, I decided to try your suggestion, and built a new copy of slurm (17.02.11). Interestingly, mvapich2 failed to build for the same reason – it still could not find pmi2.h. However, since this was a local slurm installation, I had greater license to try something that I saw elsewhere on the slurm discussion board, and decided to go to the contribs/pmi2 directory in the slurm source and run “make” followed by “make install”. This appears to have partially resolved the problem, at least in the sense that MVAPICH2 no longer chokes at that point, successfully finding the pmi2.h file. On this basis I went back to the server’s antiquated slurm build and performed the same process – going to the original 14.11 source directory, entering contribs/pmi2, and running make followed by make install. This appears to have added the relevant PMI2 files to slurm without needing a restart, and MVAPICH2 now builds with the “--with-pmi=pmi2” option!

Unfortunately, it looks like this has all been for nought, as I still get the same poor performance with MVAPICH2 when running with slurm, with much better performance when running “bare” (i.e. using mpirun on a disconnected node). The issue appears to be in the inter-core communication. Calculations on independent cores are just as fast when using slurm or MVAPICH2’s mpirun, but work involving inter-core communication is an order of magnitude slower when using srun than when using mpirun. I had hoped that PMI-2 would help, but it appears not. While I know this is a rather broad question, if you have any ideas as to what might cause this kind of slowdown I would greatly appreciate them! Hopefully this is some minor setup flaw that I’ve introduced, but if there was some change in a more recent version of slurm that could explain this behavior then I will bite the bullet and upgrade to slurm v17 whenever we have a quiet weekend on the cluster.

Regards,

Seb

*From:*Andy Riebs [mailto:andy.ri...@hpe.com]
*Sent:* Friday, September 29, 2017 12:54 PM
*To:* slurm-dev <slurm-dev@schedmd.com>
*Subject:* [slurm-dev] Re: slurm with PMI2

Hi Seb,

Is there any chance that Slurm 14.11 put pmi2.h somewhere else, perhaps dropping the "slurm/" and putting it straight into, say, /usr/include/pmi2.h?

I hate to say it, but Slurm 14.11 is nearly 3 years old, and there is a new major release of Slurm every 6 months or so. I've been using 16.05 and 17.11.0-0pre2 with that recipe for mvapich2, and no pmi2-specific build options for Slurm, with success.

You might try building a private instance of a current version of Slurm (probably 17.02.x), and then see if you can build mvapich2 against that. This would save the disruption of actually replacing Slurm on your system until you had some reason to think it would help. And if it works, go for it!

Andy

On 09/29/2017 12:40 PM, Sebastian Eastham wrote:

    Thanks Andy! Do you happen to know if you took any special steps
    when building slurm to ensure that pmi2 support was present? At
    this end, my MVAPICH2 configure (when I try to use pmi2) is

    ./configure --with-pm=slurm --prefix=${installDir}
    --with-slurm=/opt/slurm --with-pmi=pmi2

    However, as long as the “--with-pmi=pmi2” option is included, it
    fails with “could not find slurm/pmi2.h”. Unfortunately
    “--enable-slurm=yes” did not resolve the issue; the same is true
    for directly specifying the slurm include and library paths
    (--with-slurm-include and --with-slurm-lib). My suspicion is that
    our installation of slurm currently lacks the necessary PMI2
    header file and libraries, but I am not clear on how to install
    these. If there is any way to install them without reinstalling
    slurm then that would be ideal, but if a reinstall is necessary
    then that too can be scheduled.

    Regards,

    Seb

    *From:*Andy Riebs [mailto:andy.ri...@hpe.com]
    *Sent:* Friday, September 29, 2017 12:30 PM
    *To:* slurm-dev <slurm-dev@schedmd.com> <mailto:slurm-dev@schedmd.com>
    *Subject:* [slurm-dev] Re: slurm with PMI2

    FWIW, we include these options when we build mvapich2:

            --with-pmi=pmi2 \
            --with-pm=slurm \
            --with-slurm=/opt/slurm  \
            --enable-slurm=yes"

    It feels like there is some redundancy there, but it works!

    Andy

    On 09/29/2017 12:12 PM, Sebastian Eastham wrote:

        Dear Slurm Developers mailing list,

        I was hoping for a quick clarification regarding PMI2 support
        in slurm. We are running slurm v14.11.5, and installing
        MVAPICH2 v2.3b, which is listed as preferring PMI2. However,
        we found upon trying to configure MVAPICH2 that we could not
        use the flag “--with-pmi=pmi2”, as this resulted in the error
        “could not find slurm/pmi2.h”. As a result, we built MVAPICH2
        without this flag. Strangely, we can successfully run MVAPICH2
        MPI jobs with slurm, using srun -n $SLURM_NTASKS --mpi=pmi2
        ./my_binary, but code which requires communication between
        cores is running extremely slowly. My assumption is that this
        is because PMI2 is not actually being used, in spite of the
        --mpi-pmi2 flag for srun.

        My questions are:

         1. I notice that “srun --mpi=list” shows pmi2 as an option,
            and that I can launch srun with the flag --mpi=pmi2.
            However, given that our installed slurm/lib directory does
            not contain a libpmi2.so file (only libpmi.*/libslurm*.*
            etc), and I cannot find pmi2.h installed anywhere, am I
            right in thinking that our as-installed version of slurm
            does not support PMI2? Some of the archived mailing list
            posts seem to support this conclusion, but I was not sure.
         2. If indeed we do not have pmi2 support, what is the
            procedure to upgrade our build to include pmi2 support?
            Can this be done without fully reinstalling slurm? If so,
            how? If not, what additional steps will need to be taken
            on the reinstall to ensure that slurm has the required
            pmi2 support?

        I appreciate any help or guidance that you can give me!

        Regards,


        Seb

        =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=

        Dr. Sebastian D. Eastham

        Research Scientist

        Laboratory for Aviation and the Environment

        Massachusetts Institute of Technology

        =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=


Reply via email to