> On Dec 19, 2017, at 8:46 AM, Charles A Taylor <chas...@ufl.edu> wrote:
> 
> Hi All,
> 
> I’m glad to see this come up.  We’ve used OpenMPI for a long time and 
> switched to SLURM (from torque+moab) about 2.5 years ago.  At the time, I had 
> a lot of questions about running MPI jobs under SLURM and good information 
> seemed to be scarce - especially regarding “srun”.   I’ll just briefly share 
> my/our observations.  For those who are interested, there are examples of our 
> suggested submission scripts at 
> https://help.rc.ufl.edu/doc/Sample_SLURM_Scripts#MPI_job 
> <https://help.rc.ufl.edu/doc/Sample_SLURM_Scripts#MPI_job> (as I type this 
> I’m hoping that page is up-to-date).  Feel free to comment or make 
> suggestions if you have had different experiences or know better (very 
> possible).
> 
> 1. We initially ignored srun since mpiexec _seemed_ to work fine (more below).
> 
> 2. We soon started to get user complaints of MPI apps running at 1/2 to 1/3 
> of their expected or previously observed speeds - but only sporadically - 
> meaning that sometimes the same job, submitted the same way would run at full 
> speed and sometimes at 1/2 or 1/3 (almost exactly) speed.
> 
> Investigation showed that some MPI ranks in the job were time-slicing across 
> one or more of the cores allocated by SLURM.  It turns out that if the slurm 
> allocation is not consistent with the default OMPI core/socket mapping, this 
> can easily happen.  It can be avoided by a) using “srun —mpi=pmi2” or as of 
> 2.x, “srun —mpi=pmix” or b) more carefully crafting your slurm resource 
> request to be consistent with the OMPI default core/socket mapping.

Or one could tell OMPI to do what you really want it to do using map-by and 
bind-to options, perhaps putting them in the default MCA param file.

Or you could enable cgroups in slurm so that OMPI sees the binding envelope - 
it will respect it. The problem is that OMPI isn’t seeing the requested binding 
envelope and thinks resources are available that really aren’t, and so it gets 
confused about how to map things. Slurm expresses that envelope in an envar, 
but the name and syntax keep changing over the releases, and we just can’t 
track it all the time.

However, I agree that it can be a problem if Slurm is allocating resources in a 
non-HPC manner (i.e., not colocating allocations to maximize performance) and 
you just want to use the default mpiexec options. We only see that when someone 
configures slurm to not allocate nodes to single users, which is not the normal 
HPC mode of operation.

So if you are going to configure slurm to operate in the “cloud” mode of 
allocating individual processor assets, then yes - probably better to use srun 
instead of the default mpiexec options, or add some directives to the default 
MCA param file.

> 
> So beware of resource requests that specify only the number of tasks 
> (—ntasks=64) and then launch with “mpiexec”.  Slurm will happily allocate 
> those tasks anywhere it can (on a busy cluster) and you will get some very 
> non-optimal core mappings/bindings and, possibly, core sharing.
> 
> 3. While doing some spank development for a local, per-job (not per step) 
> temporary directory, I noticed that when launching multi-host MPI jobs with 
> mpiexec vs srun, you end up with more than one host with “slurm_nodeid=1”.  
> I’m not sure if this is a bug (it was 15.08.x) or not and it didn’t seem to 
> cause issues but I also don’t think that it is ideal for two nodes in the 
> same job to have the some numeric nodeid.   When launching with “srun”, that 
> didn’t happen.

I’m not sure what “slurm_nodeid” is - where does this come from?

> 
> Anyway, that is what we have observed.  Generally speaking, I try to get 
> users to use “srun” but many of them still use “mpiexec” out of habit.  You 
> know what they say about old habits.  

Again, it truly depends on how things are configured, if the users are using 
scripts that need to port to other environments, etc.

> 
> Comments, suggestions, or just other experiences are welcome.  Also, if 
> anyone is interested in the tmpdir spank plugin, you can contact me.  We are 
> happy to share.
> 
> Best and Merry Christmas to all,
> 
> Charlie Taylor
> UF Research Computing
> 
> 
> 
>> On Dec 18, 2017, at 8:12 PM, r...@open-mpi.org <mailto:r...@open-mpi.org> 
>> wrote:
>> 
>> We have had reports of applications running faster when executing under 
>> OMPI’s mpiexec versus when started by srun. Reasons aren’t entirely clear, 
>> but are likely related to differences in mapping/binding options (OMPI 
>> provides a very large range compared to srun) and optimization flags 
>> provided by mpiexec that are specific to OMPI.
>> 
>> OMPI uses PMIx for wireup support (starting with the v2.x series), which 
>> provides a faster startup than other PMI implementations. However, that is 
>> also available with Slurm starting with the 16.05 release, and some further 
>> PMIx-based launch optimizations were recently added to the Slurm 17.11 
>> release. So I would expect that launch via srun with the latest Slurm 
>> release and PMIx would be faster than mpiexec - though that still leaves the 
>> faster execution reports to consider.
>> 
>> HTH
>> Ralph
>> 
>> 
>>> On Dec 18, 2017, at 2:18 PM, Prentice Bisbal <pbis...@pppl.gov 
>>> <mailto:pbis...@pppl.gov>> wrote:
>>> 
>>> Greeting OpenMPI users and devs!
>>> 
>>> We use OpenMPI with Slurm as our scheduler, and a user has asked me this: 
>>> should they use mpiexec/mpirun or srun to start their MPI jobs through 
>>> Slurm?
>>> 
>>> My inclination is to use mpiexec, since that is the only method that's 
>>> (somewhat) defined in the MPI standard and therefore the most portable, and 
>>> the examples in the OpenMPI FAQ use mpirun. However, the Slurm 
>>> documentation on the schedmd website say to use srun with the --mpi=pmi 
>>> option. (See links below)
>>> 
>>> What are the pros/cons of using these two methods, other than the 
>>> portability issue I already mentioned? Does srun+pmi use a different method 
>>> to wire up the connections? Some things I read online seem to indicate 
>>> that. If slurm was built with PMI support, and OpenMPI was built with Slurm 
>>> support, does it really make any difference?
>>> 
>>> https://urldefense.proofpoint.com/v2/url?u=https-3A__www.open-2Dmpi.org_faq_-3Fcategory-3Dslurm&d=DwIGaQ&c=pZJPUDQ3SB9JplYbifm4nt2lEVG5pWx2KikqINpWlZM&r=HOtXciFqK5GlgIgLAxthUQ&m=Sy962rSDsvvbXNklxKSTlm8Vk-RymisPdTjspDVlROI&s=S8O6oozkRUTwijpQQDmGrJZb8Bmnsro9a88Z8CMu6jY&e=
>>>  
>>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__www.open-2Dmpi.org_faq_-3Fcategory-3Dslurm&d=DwIGaQ&c=pZJPUDQ3SB9JplYbifm4nt2lEVG5pWx2KikqINpWlZM&r=HOtXciFqK5GlgIgLAxthUQ&m=Sy962rSDsvvbXNklxKSTlm8Vk-RymisPdTjspDVlROI&s=S8O6oozkRUTwijpQQDmGrJZb8Bmnsro9a88Z8CMu6jY&e=>
>>>  
>>> https://urldefense.proofpoint.com/v2/url?u=https-3A__slurm.schedmd.com_mpi-5Fguide.html-23open-5Fmpi&d=DwIGaQ&c=pZJPUDQ3SB9JplYbifm4nt2lEVG5pWx2KikqINpWlZM&r=HOtXciFqK5GlgIgLAxthUQ&m=Sy962rSDsvvbXNklxKSTlm8Vk-RymisPdTjspDVlROI&s=yqxzEPgafSoGS_SpzI5MPObbJIcemIX7Z4AHgk4SseA&e=
>>>  
>>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__slurm.schedmd.com_mpi-5Fguide.html-23open-5Fmpi&d=DwIGaQ&c=pZJPUDQ3SB9JplYbifm4nt2lEVG5pWx2KikqINpWlZM&r=HOtXciFqK5GlgIgLAxthUQ&m=Sy962rSDsvvbXNklxKSTlm8Vk-RymisPdTjspDVlROI&s=yqxzEPgafSoGS_SpzI5MPObbJIcemIX7Z4AHgk4SseA&e=>
>>>  
>>> 
>>> 
>>> -- 
>>> Prentice
>>> 
>>> _______________________________________________
>>> users mailing list
>>> users@lists.open-mpi.org <mailto:users@lists.open-mpi.org>
>>> https://urldefense.proofpoint.com/v2/url?u=https-3A__lists.open-2Dmpi.org_mailman_listinfo_users&d=DwIGaQ&c=pZJPUDQ3SB9JplYbifm4nt2lEVG5pWx2KikqINpWlZM&r=HOtXciFqK5GlgIgLAxthUQ&m=Sy962rSDsvvbXNklxKSTlm8Vk-RymisPdTjspDVlROI&s=AKWlbF5DdrTaJapOsTSYDiSa3bJTFnjlG6Whfi2_MA4&e=
>>>  
>>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__lists.open-2Dmpi.org_mailman_listinfo_users&d=DwIGaQ&c=pZJPUDQ3SB9JplYbifm4nt2lEVG5pWx2KikqINpWlZM&r=HOtXciFqK5GlgIgLAxthUQ&m=Sy962rSDsvvbXNklxKSTlm8Vk-RymisPdTjspDVlROI&s=AKWlbF5DdrTaJapOsTSYDiSa3bJTFnjlG6Whfi2_MA4&e=>
>>>  
>> 
>> _______________________________________________
>> users mailing list
>> users@lists.open-mpi.org <mailto:users@lists.open-mpi.org>
>> https://urldefense.proofpoint.com/v2/url?u=https-3A__lists.open-2Dmpi.org_mailman_listinfo_users&d=DwIGaQ&c=pZJPUDQ3SB9JplYbifm4nt2lEVG5pWx2KikqINpWlZM&r=HOtXciFqK5GlgIgLAxthUQ&m=Sy962rSDsvvbXNklxKSTlm8Vk-RymisPdTjspDVlROI&s=AKWlbF5DdrTaJapOsTSYDiSa3bJTFnjlG6Whfi2_MA4&e=
>>  
>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__lists.open-2Dmpi.org_mailman_listinfo_users&d=DwIGaQ&c=pZJPUDQ3SB9JplYbifm4nt2lEVG5pWx2KikqINpWlZM&r=HOtXciFqK5GlgIgLAxthUQ&m=Sy962rSDsvvbXNklxKSTlm8Vk-RymisPdTjspDVlROI&s=AKWlbF5DdrTaJapOsTSYDiSa3bJTFnjlG6Whfi2_MA4&e=>
> _______________________________________________
> users mailing list
> users@lists.open-mpi.org <mailto:users@lists.open-mpi.org>
> https://lists.open-mpi.org/mailman/listinfo/users 
> <https://lists.open-mpi.org/mailman/listinfo/users>
_______________________________________________
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Reply via email to