Hi Ryan, do you still have SLURM's TaskAffinity turned off?  If this is
turned on I would expect your binding problem to be resolved since you're
now using srun to launch the jobs.

On Wed, Feb 11, 2015 at 3:34 PM, Ryan Novosielski <[email protected]>
wrote:

>
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> So I've rebuilt MVAPICH2 2.0.1, with the flags for SLURM PMI. Then I
> rebuilt AMBER 14. It solved one problem, which was that every time I
> ran pmemd.cuda.MPI with srun, it would run all of the processes
> against the same GPU card. That no longer happens and it uses
> different cards. However, the CPU affinity thing is still a problem.
> It will use the same CPU's for both jobs. This does not seem to be an
> issue when running jobs via Torque, so I'm not really sure what is
> happening. Could it be that it is relying on the Torque cpuset in
> order to cause certain CPU's to be used and that MVAPICH2 is not
> really doing it?
>
> I'm actually not sure what mailing list this belongs on at this point.
> It does seem as is this works right with Torque and not SLURM, which
> would seem to implicate SLURM. But it seems that MVAPICH2 should be
> making this happen and isn't. I guess if I had some pointers for where
> to look here, I could figure out what's going on.
>
> On 02/10/2015 09:24 PM, Novosielski, Ryan wrote:
> >
> > So it certainly could be related to affinity. Here is the
> > affinity-related output from the two PBS jobs:
> >
> > run001: -------------CPU AFFINITY------------- RANK:0  CPU_SET:
> > 0 RANK:1  CPU_SET:   1 -------------------------------------
> > -------------CPU AFFINITY------------- RANK:0  CPU_SET:   0 RANK:1
> > CPU_SET:   1 -------------------------------------
> >
> > run002: -------------CPU AFFINITY------------- RANK:0  CPU_SET:
> > 4 RANK:1  CPU_SET:   5 -------------------------------------
> > -------------CPU AFFINITY------------- RANK:0  CPU_SET:   4 RANK:1
> > CPU_SET:   5 -------------------------------------
> >
> > Now the two SLURM jobs: run001: -------------CPU
> > AFFINITY------------- RANK:0  CPU_SET:   0 RANK:1  CPU_SET:   1
> > -------------------------------------
> >
> > run002: -------------CPU AFFINITY------------- RANK:0  CPU_SET:
> > 0 RANK:1  CPU_SET:   1 -------------------------------------
> >
> > Both jobs are running on an MVAPICH2 that was built without setting
> > SLURM as the PMI. The jobs spawn the processes via MVAPICH2's
> > mpiexec. I tried srun, but it seemed to run all of the jobs on the
> > same GPU, so I was waiting to recompile MVAPICH2 and then AMBER
> > using the SLURM PMI. I guess it's possible that will solve the
> > problem, but this is still peculiar.
> >
> > -- ____ *Note: UMDNJ is now Rutgers-Biomedical and Health
> > Sciences* || \\UTGERS
> > |---------------------*O*--------------------- ||_// Biomedical |
> > Ryan Novosielski - Senior Technologist || \\ and Health |
> > [email protected] - 973/972.0922 (2x0922) ||  \\  Sciences |
> > OIRT/High Perf & Res Comp - MSB C630, Newark `'
> > ________________________________________ From: Jonathan Perkins
> > [[email protected]] Sent: Tuesday, February 10, 2015 4:42
> > PM To: slurm-dev Cc: Novosielski, Ryan Subject: [slurm-dev] Amber +
> > MVAPICH2 slower with SLURM vs PBS
> >
> > Do you have both environments available to do this comparision.  If
> > so, is only SLURM vs Torque the only difference?
> >
> > I do think that it'll be good to provide the output of the MPI job
> > with those two variables that I mentioned in the earlier post.
> > Maybe it will show a difference in affinity.  Otherwise it can be
> > something else at play.
> >
> > Between your two jobs with SLURM, did you did you only flip the
> > Task Affinity setting?  It seems that affinity in MVAPICH2 was
> > enabled in both runs so I would expect the second run to not
> > perform so badly.
> >
> > On Sat, Feb 07, 2015 at 10:48:31AM -0800, Novosielski, Ryan wrote:
> >> So I turned off TaskAffinity (=none) and we ran two CUDA/GPU jobs
> >> on one node. Apparently the performance with PBS/Torque is good
> >> and with Slurm it is not. I'm confused as to why it would make
> >> any difference:
> >>
> >> Running 1 MPI job with slurm Gpu utilization 74-99% Cpu  — 4 CPU
> >> cores, 0-3 64-84% utilization
> >>
> >> Speed: 21ns/day vs previously reported 25.6ns/day with PBS
> >>
> >> Submitting the second MPI job Gpu utilization down to 9-13%
> >> (slightly better, it was 1-3% before [with TaskAffinity
> >> enabled]) Cpu  — 4 CPU cores, the same 0-3 99%  utilization
> >>
> >> Speed:  slow..   Okay, finally got it  1.45ns/day
> >>
> >> Would that variable still be helpful to try?
> >>
> >> We're using Slurm 14.11.3, MVAPICH2 2.0, Intel Compiler 15.0.1,
> >> and AMBER 14 for these performance numbers. GPU's are M2090's I
> >> think. I'd have to check that he wasn't using the K20's.
> >>
> >> ____ *Note: UMDNJ is now Rutgers-Biomedical and Health Sciences*
> >> || \\UTGERS      |---------------------*O*---------------------
> >> ||_// Biomedical | Ryan Novosielski - Senior Technologist || \\
> >> and Health | [email protected]<mailto:[email protected]>-
> >> 973/972.0922 (2x0922) ||  \\  Sciences | OIRT/High Perf & Res
> >> Comp - MSB C630, Newark `'
> >>
> >> On Feb 7, 2015, at 12:34, Jonathan Perkins
> >> <[email protected]<mailto:[email protected]>>
> >> wrote:
> >>
> >>
> >> Can you set MV2_SHOW_CPU_BINDING equal to 1 when running your
> >> job?  This should show whether affinity is causing your processes
> >> to be oversubscribed on a set of cores.
> >>
> >> If this is the case you can disable the affinity from the library
> >> by setting MV2_ENABLE_AFFINITY to 0.
> >>
> >> On Fri, Feb 06, 2015 at 03:18:48PM -0800, Novosielski, Ryan
> >> wrote: I am running into a similar problem, with Slurm 14.11.3
> >> and MVAPICH2 2.0. I am wondering if perhaps having CPU affinity
> >> configured in MVAPICH2 and Slurm at the same time isn't a bad
> >> idea (I've also since realized that it uses cgroups and that the
> >> 2.6.18 kernel in RHEL5 does not support it anyway -- but it
> >> didn't seem to be harming anything. Maybe it was?).
> >>
> >> ____ *Note: UMDNJ is now Rutgers-Biomedical and Health Sciences*
> >> || \\UTGERS      |---------------------*O*---------------------
> >> ||_// Biomedical | Ryan Novosielski - Senior Technologist || \\
> >> and Health |
> >> [email protected]<mailto:[email protected]><mailto:
> [email protected]>-
> >> 973/972.0922 (2x0922) ||  \\  Sciences | OIRT/High Perf & Res
> >> Comp - MSB C630, Newark `'
> >
> > -- Jonathan Perkins
> >
>
> - --
> ____ *Note: UMDNJ is now Rutgers-Biomedical and Health Sciences*
> || \\UTGERS      |---------------------*O*---------------------
> ||_// Biomedical | Ryan Novosielski - Senior Technologist
> || \\ and Health | [email protected] - 973/972.0922 (2x0922)
> ||  \\  Sciences | OIRT/High Perf & Res Comp - MSB C630, Newark
>      `'
> -----BEGIN PGP SIGNATURE-----
> Version: GnuPG v1
>
> iEYEARECAAYFAlTbuiwACgkQmb+gadEcsb4HQwCdG8nMs/Qxt2JqVpyBtqoS1IvQ
> UScAoNpGLd3AhH0Zkyh0J0XRIFwg66FN
> =HGsw
> -----END PGP SIGNATURE-----
>



-- 
Jonathan Perkins
http://www.cse.ohio-state.edu/~perkinjo

Reply via email to