Re: [OMPI users] slowdown with infiniband and latest CentOS kernel

2014-03-04 Thread Dave Love
Tru Huynh writes: > afaik, 2.6.32-431 series is from RHEL(and clones) version >=6.5 [Right.] > otoh, it might be related to http://bugs.centos.org/view.php?id=6949 That looks likely. As we bind to cores, we wouldn't see it for MPI processes, at least, and will see higher

Re: [OMPI users] slowdown with infiniband and latest CentOS kernel

2014-03-04 Thread Dave Love
Bernd Dammann writes: > We use Moab/Torque, so we could use cpusets (but that has had some > other side effects earlier, so we did not implement it in our setup). I don't know remember Torque does, but core binding and (Linux) cpusets are somewhat orthogonal. While a cpuset

Re: [OMPI users] slowdown with infiniband and latest CentOS kernel

2014-03-04 Thread Bernd Dammann
On 3/2/14 0:44 AM, Tru Huynh wrote: On Fri, Feb 28, 2014 at 08:49:45AM +0100, Bernd Dammann wrote: Maybe I should say, that we moved from SL 6.1 and OMPI 1.4.x to SL 6.4 with the above kernel, and OMPI 1.6.5 - which means a major upgrade of our cluster. After the upgrade, users reported those

Re: [OMPI users] slowdown with infiniband and latest CentOS kernel

2014-02-28 Thread Bernd Dammann
On 2/27/14 14:06 PM, Noam Bernstein wrote: On Feb 27, 2014, at 2:36 AM, Patrick Begou wrote: Bernd Dammann wrote: Using the workaround '--bind-to-core' does only make sense for those jobs, that allocate full nodes, but the majority of our jobs don't do

Re: [OMPI users] slowdown with infiniband and latest CentOS kernel

2014-02-28 Thread Bernd Dammann
On 2/27/14 16:47 PM, Dave Love wrote: Bernd Dammann writes: Hi, I found this thread from before Christmas, and I wondered what the status of this problem is. We experience the same problems since our upgrade to Scientific Linux 6.4, kernel 2.6.32-431.1.2.el6.x86_64, and

Re: [OMPI users] slowdown with infiniband and latest CentOS kernel

2014-02-27 Thread John Hearns
Noam, cpusets are a very good idea. Not only for CPU binding but for isolating 'badky behaved' applications. If an application stsrts using huge amounts of memory - kill it, collapse the cpuset and it is gone - nice clean way to manage jobs.

Re: [OMPI users] slowdown with infiniband and latest CentOS kernel

2014-02-27 Thread Ralph Castain
On Feb 27, 2014, at 5:06 AM, Noam Bernstein wrote: > On Feb 27, 2014, at 2:36 AM, Patrick Begou > wrote: > >> Bernd Dammann wrote: >>> Using the workaround '--bind-to-core' does only make sense for those jobs, >>> that

Re: [OMPI users] slowdown with infiniband and latest CentOS kernel

2014-02-27 Thread Noam Bernstein
On Feb 27, 2014, at 2:36 AM, Patrick Begou wrote: > Bernd Dammann wrote: >> Using the workaround '--bind-to-core' does only make sense for those jobs, >> that allocate full nodes, but the majority of our jobs don't do that. > Why ? > We still use this option

Re: [OMPI users] slowdown with infiniband and latest CentOS kernel

2014-02-27 Thread Patrick Begou
Bernd Dammann wrote: Using the workaround '--bind-to-core' does only make sense for those jobs, that allocate full nodes, but the majority of our jobs don't do that. Why ? We still use this option in OpenMPI (1.6.x, 1.7.x) with OpenFOAM and other applications to attach each process on its core

Re: [OMPI users] slowdown with infiniband and latest CentOS kernel

2014-02-26 Thread Bernd Dammann
Hi, I found this thread from before Christmas, and I wondered what the status of this problem is. We experience the same problems since our upgrade to Scientific Linux 6.4, kernel 2.6.32-431.1.2.el6.x86_64, and OpenMPI 1.6.5. Users have reported severe slowdowns in all kinds of

Re: [OMPI users] slowdown with infiniband and latest CentOS kernel

2013-12-19 Thread Noam Bernstein
On Dec 18, 2013, at 5:19 PM, Martin Siegert wrote: > > Thanks for figuring this out. Does this work for 1.6.x as well? > The FAQ http://www.open-mpi.org/faq/?category=tuning#using-paffinity > covers versions 1.2.x to 1.5.x. > Does 1.6.x support mpi_paffinity_alone = 1 ? > I set

Re: [OMPI users] slowdown with infiniband and latest CentOS kernel

2013-12-19 Thread Dave Love
Brice Goglin writes: > hwloc-ps (and lstopo --top) are better at showing process binding but > they lack a nice pseudographical interface with dynamic refresh. That seems like an advantage when you want to check on a cluster! > htop uses hwloc internally iirc, so there's

Re: [OMPI users] slowdown with infiniband and latest CentOS kernel

2013-12-19 Thread Dave Love
Noam Bernstein writes: > On Dec 18, 2013, at 10:32 AM, Dave Love wrote: > >> Noam Bernstein writes: >> >>> We specifically switched to 1.7.3 because of a bug in 1.6.4 (lock up in >>> some >>> collective

Re: [OMPI users] slowdown with infiniband and latest CentOS kernel

2013-12-18 Thread Martin Siegert
Hi, expanding on Noam's problem a bit ... On Wed, Dec 18, 2013 at 10:19:25AM -0500, Noam Bernstein wrote: > Thanks to all who answered my question. The culprit was an interaction > between > 1.7.3 not supporting mpi_paffinity_alone (which we were using previously) and > the new > kernel.

Re: [OMPI users] slowdown with infiniband and latest CentOS kernel

2013-12-18 Thread Ake Sandgren
On Wed, 2013-12-18 at 11:47 -0500, Noam Bernstein wrote: > Yes - I never characterized it fully, but we attached with gdb to every > single vasp running process, and all were stuck in the same > call to MPI_allreduce() every time. It's only happening on a rather large > jobs, so it's not the

Re: [OMPI users] slowdown with infiniband and latest CentOS kernel

2013-12-18 Thread Brice Goglin
hwloc-ps (and lstopo --top) are better at showing process binding but they lack a nice pseudographical interface with dynamic refresh. htop uses hwloc internally iirc, so there's hope we'll have everything needed in htop one day ;) Brice Dave Love a écrit : >John

Re: [OMPI users] slowdown with infiniband and latest CentOS kernel

2013-12-18 Thread Dave Love
John Hearns writes: > 'Htop' is a very good tool for looking at where processes are running. I'd have thought hwloc-ps is the tool for that.

Re: [OMPI users] slowdown with infiniband and latest CentOS kernel

2013-12-18 Thread Dave Love
Noam Bernstein writes: > We specifically switched to 1.7.3 because of a bug in 1.6.4 (lock up in some > collective communication), but now I'm wondering whether I should just test > 1.6.5. What bug, exactly? As you mentioned vasp, is it specifically affecting

Re: [OMPI users] slowdown with infiniband and latest CentOS kernel

2013-12-18 Thread Noam Bernstein
Thanks to all who answered my question. The culprit was an interaction between 1.7.3 not supporting mpi_paffinity_alone (which we were using previously) and the new kernel. Switching to --bind-to core (actually the environment variable OMPI_MCA_hwloc_base_binding_policy=core) fixed the

Re: [OMPI users] slowdown with infiniband and latest CentOS kernel

2013-12-17 Thread Maxime Boissonneault
Hi, Do you have thread multiples enabled in your OpenMPI installation ? Maxime Boissonneault Le 2013-12-16 17:40, Noam Bernstein a écrit : Has anyone tried to use openmpi 1.7.3 with the latest CentOS kernel (well, nearly latest: 2.6.32-431.el6.x86_64), and especially with infiniband? I'm

Re: [OMPI users] slowdown with infiniband and latest CentOS kernel

2013-12-17 Thread Ralph Castain
OMPI_MCA_hwloc_base_binding_policy=core On Dec 17, 2013, at 8:40 AM, Noam Bernstein wrote: > On Dec 17, 2013, at 11:04 AM, Ralph Castain wrote: > >> Are you binding the procs? We don't bind by default (this will change in >> 1.7.4), and

Re: [OMPI users] slowdown with infiniband and latest CentOS kernel

2013-12-17 Thread Noam Bernstein
On Dec 17, 2013, at 11:04 AM, Ralph Castain wrote: > Are you binding the procs? We don't bind by default (this will change in > 1.7.4), and binding can play a significant role when comparing across kernels. > > add "--bind-to-core" to your cmd line Now that it works, is

Re: [OMPI users] slowdown with infiniband and latest CentOS kernel

2013-12-17 Thread Noam Bernstein
On Dec 17, 2013, at 11:04 AM, Ralph Castain wrote: > Are you binding the procs? We don't bind by default (this will change in > 1.7.4), and binding can play a significant role when comparing across kernels. > > add "--bind-to-core" to your cmd line Yeay - it works. Thank

Re: [OMPI users] slowdown with infiniband and latest CentOS kernel

2013-12-17 Thread Nathan Hjelm
On Tue, Dec 17, 2013 at 11:16:48AM -0500, Noam Bernstein wrote: > On Dec 17, 2013, at 11:04 AM, Ralph Castain wrote: > > > Are you binding the procs? We don't bind by default (this will change in > > 1.7.4), and binding can play a significant role when comparing across > >

Re: [OMPI users] slowdown with infiniband and latest CentOS kernel

2013-12-17 Thread Noam Bernstein
On Dec 17, 2013, at 11:04 AM, Ralph Castain wrote: > Are you binding the procs? We don't bind by default (this will change in > 1.7.4), and binding can play a significant role when comparing across kernels. > > add "--bind-to-core" to your cmd line I've previously always

Re: [OMPI users] slowdown with infiniband and latest CentOS kernel

2013-12-17 Thread John Hearns
'Htop' is a very good tool for looking at where processes are running.

Re: [OMPI users] slowdown with infiniband and latest CentOS kernel

2013-12-17 Thread Ralph Castain
Are you binding the procs? We don't bind by default (this will change in 1.7.4), and binding can play a significant role when comparing across kernels. add "--bind-to-core" to your cmd line On Dec 17, 2013, at 7:09 AM, Noam Bernstein wrote: > On Dec 16, 2013, at

Re: [OMPI users] slowdown with infiniband and latest CentOS kernel

2013-12-17 Thread Noam Bernstein
On Dec 16, 2013, at 5:40 PM, Noam Bernstein wrote: > > Once I have some more detailed information I'll follow up. OK - I've tried to characterize the behavior with vasp, which accounts for most of our cluster usage, and it's quite odd. I ran my favorite