Hi All, I’m glad to see this come up. We’ve used OpenMPI for a long time and switched to SLURM (from torque+moab) about 2.5 years ago. At the time, I had a lot of questions about running MPI jobs under SLURM and good information seemed to be scarce - especially regarding “srun”. I’ll just briefly share my/our observations. For those who are interested, there are examples of our suggested submission scripts at https://help.rc.ufl.edu/doc/Sample_SLURM_Scripts#MPI_job <https://help.rc.ufl.edu/doc/Sample_SLURM_Scripts#MPI_job> (as I type this I’m hoping that page is up-to-date). Feel free to comment or make suggestions if you have had different experiences or know better (very possible).
1. We initially ignored srun since mpiexec _seemed_ to work fine (more below). 2. We soon started to get user complaints of MPI apps running at 1/2 to 1/3 of their expected or previously observed speeds - but only sporadically - meaning that sometimes the same job, submitted the same way would run at full speed and sometimes at 1/2 or 1/3 (almost exactly) speed. Investigation showed that some MPI ranks in the job were time-slicing across one or more of the cores allocated by SLURM. It turns out that if the slurm allocation is not consistent with the default OMPI core/socket mapping, this can easily happen. It can be avoided by a) using “srun —mpi=pmi2” or as of 2.x, “srun —mpi=pmix” or b) more carefully crafting your slurm resource request to be consistent with the OMPI default core/socket mapping. So beware of resource requests that specify only the number of tasks (—ntasks=64) and then launch with “mpiexec”. Slurm will happily allocate those tasks anywhere it can (on a busy cluster) and you will get some very non-optimal core mappings/bindings and, possibly, core sharing. 3. While doing some spank development for a local, per-job (not per step) temporary directory, I noticed that when launching multi-host MPI jobs with mpiexec vs srun, you end up with more than one host with “slurm_nodeid=1”. I’m not sure if this is a bug (it was 15.08.x) or not and it didn’t seem to cause issues but I also don’t think that it is ideal for two nodes in the same job to have the some numeric nodeid. When launching with “srun”, that didn’t happen. Anyway, that is what we have observed. Generally speaking, I try to get users to use “srun” but many of them still use “mpiexec” out of habit. You know what they say about old habits. Comments, suggestions, or just other experiences are welcome. Also, if anyone is interested in the tmpdir spank plugin, you can contact me. We are happy to share. Best and Merry Christmas to all, Charlie Taylor UF Research Computing > On Dec 18, 2017, at 8:12 PM, r...@open-mpi.org wrote: > > We have had reports of applications running faster when executing under > OMPI’s mpiexec versus when started by srun. Reasons aren’t entirely clear, > but are likely related to differences in mapping/binding options (OMPI > provides a very large range compared to srun) and optimization flags provided > by mpiexec that are specific to OMPI. > > OMPI uses PMIx for wireup support (starting with the v2.x series), which > provides a faster startup than other PMI implementations. However, that is > also available with Slurm starting with the 16.05 release, and some further > PMIx-based launch optimizations were recently added to the Slurm 17.11 > release. So I would expect that launch via srun with the latest Slurm release > and PMIx would be faster than mpiexec - though that still leaves the faster > execution reports to consider. > > HTH > Ralph > > >> On Dec 18, 2017, at 2:18 PM, Prentice Bisbal <pbis...@pppl.gov> wrote: >> >> Greeting OpenMPI users and devs! >> >> We use OpenMPI with Slurm as our scheduler, and a user has asked me this: >> should they use mpiexec/mpirun or srun to start their MPI jobs through Slurm? >> >> My inclination is to use mpiexec, since that is the only method that's >> (somewhat) defined in the MPI standard and therefore the most portable, and >> the examples in the OpenMPI FAQ use mpirun. However, the Slurm documentation >> on the schedmd website say to use srun with the --mpi=pmi option. (See links >> below) >> >> What are the pros/cons of using these two methods, other than the >> portability issue I already mentioned? Does srun+pmi use a different method >> to wire up the connections? Some things I read online seem to indicate that. >> If slurm was built with PMI support, and OpenMPI was built with Slurm >> support, does it really make any difference? >> >> https://urldefense.proofpoint.com/v2/url?u=https-3A__www.open-2Dmpi.org_faq_-3Fcategory-3Dslurm&d=DwIGaQ&c=pZJPUDQ3SB9JplYbifm4nt2lEVG5pWx2KikqINpWlZM&r=HOtXciFqK5GlgIgLAxthUQ&m=Sy962rSDsvvbXNklxKSTlm8Vk-RymisPdTjspDVlROI&s=S8O6oozkRUTwijpQQDmGrJZb8Bmnsro9a88Z8CMu6jY&e= >> >> https://urldefense.proofpoint.com/v2/url?u=https-3A__slurm.schedmd.com_mpi-5Fguide.html-23open-5Fmpi&d=DwIGaQ&c=pZJPUDQ3SB9JplYbifm4nt2lEVG5pWx2KikqINpWlZM&r=HOtXciFqK5GlgIgLAxthUQ&m=Sy962rSDsvvbXNklxKSTlm8Vk-RymisPdTjspDVlROI&s=yqxzEPgafSoGS_SpzI5MPObbJIcemIX7Z4AHgk4SseA&e= >> >> >> >> -- >> Prentice >> >> _______________________________________________ >> users mailing list >> email@example.com >> https://urldefense.proofpoint.com/v2/url?u=https-3A__lists.open-2Dmpi.org_mailman_listinfo_users&d=DwIGaQ&c=pZJPUDQ3SB9JplYbifm4nt2lEVG5pWx2KikqINpWlZM&r=HOtXciFqK5GlgIgLAxthUQ&m=Sy962rSDsvvbXNklxKSTlm8Vk-RymisPdTjspDVlROI&s=AKWlbF5DdrTaJapOsTSYDiSa3bJTFnjlG6Whfi2_MA4&e= >> > > _______________________________________________ > users mailing list > firstname.lastname@example.org > https://urldefense.proofpoint.com/v2/url?u=https-3A__lists.open-2Dmpi.org_mailman_listinfo_users&d=DwIGaQ&c=pZJPUDQ3SB9JplYbifm4nt2lEVG5pWx2KikqINpWlZM&r=HOtXciFqK5GlgIgLAxthUQ&m=Sy962rSDsvvbXNklxKSTlm8Vk-RymisPdTjspDVlROI&s=AKWlbF5DdrTaJapOsTSYDiSa3bJTFnjlG6Whfi2_MA4&e=
_______________________________________________ users mailing list email@example.com https://lists.open-mpi.org/mailman/listinfo/users