Yes. We’re in the process of migrating to SLURM (hence my other question on the MVAPICH2 list). Torque is our primary right now.
To my knowledge, the only difference is Torque/Maui vs. SLURM. It is the same set of environment variables, same MVAPICH2 stack, same CUDA version, and the same actual binary tree in both runs. We are actually running the jobs using the mpiexec that is part of MVAPICH2 in both cases; I think in the SLURM world, you’d usually use srun instead, but we’ve not gotten that working properly yet (could be because our copy of MVAPICH2 was built to use hydra, not SLURM, for the PMI?). I’ll give it a shot in any case. > On Feb 10, 2015, at 4:42 PM, Jonathan Perkins <[email protected]> > wrote: > > > Do you have both environments available to do this comparision. If so, > is only SLURM vs Torque the only difference? > > I do think that it'll be good to provide the output of the MPI job with > those two variables that I mentioned in the earlier post. Maybe it will > show a difference in affinity. Otherwise it can be something else at > play. > > Between your two jobs with SLURM, did you did you only flip the Task > Affinity setting? It seems that affinity in MVAPICH2 was enabled in > both runs so I would expect the second run to not perform so badly. > > On Sat, Feb 07, 2015 at 10:48:31AM -0800, Novosielski, Ryan wrote: >> So I turned off TaskAffinity (=none) and we ran two CUDA/GPU jobs on one >> node. Apparently the performance with PBS/Torque is good and with Slurm it >> is not. I'm confused as to why it would make any difference: >> >> Running 1 MPI job with slurm >> Gpu utilization 74-99% >> Cpu — 4 CPU cores, 0-3 >> 64-84% utilization >> >> Speed: 21ns/day vs previously reported 25.6ns/day with PBS >> >> Submitting the second MPI job >> Gpu utilization down to 9-13% (slightly better, it was 1-3% before [with >> TaskAffinity enabled]) >> Cpu — 4 CPU cores, the same 0-3 >> 99% utilization >> >> Speed: slow.. Okay, finally got it 1.45ns/day >> >> Would that variable still be helpful to try? >> >> We're using Slurm 14.11.3, MVAPICH2 2.0, Intel Compiler 15.0.1, and AMBER 14 >> for these performance numbers. GPU's are M2090's I think. I'd have to check >> that he wasn't using the K20's. >> >> ____ *Note: UMDNJ is now Rutgers-Biomedical and Health Sciences* >> || \\UTGERS |---------------------*O*--------------------- >> ||_// Biomedical | Ryan Novosielski - Senior Technologist >> || \\ and Health | [email protected]<mailto:[email protected]>- >> 973/972.0922 (2x0922) >> || \\ Sciences | OIRT/High Perf & Res Comp - MSB C630, Newark >> `' >> >> On Feb 7, 2015, at 12:34, Jonathan Perkins >> <[email protected]<mailto:[email protected]>> wrote: >> >> >> Can you set MV2_SHOW_CPU_BINDING equal to 1 when running your job? This >> should show whether affinity is causing your processes to be >> oversubscribed on a set of cores. >> >> If this is the case you can disable the affinity from the library by >> setting MV2_ENABLE_AFFINITY to 0. >> >> On Fri, Feb 06, 2015 at 03:18:48PM -0800, Novosielski, Ryan wrote: >> I am running into a similar problem, with Slurm 14.11.3 and MVAPICH2 2.0. I >> am wondering if perhaps having CPU affinity configured in MVAPICH2 and Slurm >> at the same time isn't a bad idea (I've also since realized that it uses >> cgroups and that the 2.6.18 kernel in RHEL5 does not support it anyway -- >> but it didn't seem to be harming anything. Maybe it was?). >> >> ____ *Note: UMDNJ is now Rutgers-Biomedical and Health Sciences* >> || \\UTGERS |---------------------*O*--------------------- >> ||_// Biomedical | Ryan Novosielski - Senior Technologist >> || \\ and Health | >> [email protected]<mailto:[email protected]><mailto:[email protected]>- >> 973/972.0922 (2x0922) >> || \\ Sciences | OIRT/High Perf & Res Comp - MSB C630, Newark >> `' > > -- > Jonathan Perkins ____ *Note: UMDNJ is now Rutgers-Biomedical and Health Sciences* || \\UTGERS |---------------------*O*--------------------- ||_// Biomedical | Ryan Novosielski - Senior Technologist || \\ and Health | [email protected] - 973/972.0922 (2x0922) || \\ Sciences | OIRT/High Perf & Res Comp - MSB C630, Newark `'
