Re: [OMPI users] SM btl slows down bandwidth?

Gus Correa Fri, 15 Aug 2008 15:32:26 -0400

Hello Tim and list

Having been listening and participating on this interesting thread fromthe very beginning,

I would like to add another two cents.

I don't oppose moving the specific discussion of OpenMPI memcpyimplementation to the developers' list.I am not a developer (or qualified to be one) and I don't subscribe tothe developers' list,

but I would subscribe just to follow this thread.

However, I have the feeling that the general question originally posedby Daniel Mantione

(why his programs ran faster under Infiniband than under shared memory)
has an interest of its own which goes beyond the developers community.

Just like Daniel and many others, I have seen some disappointingperformance of MPI code on multicore machines,

in code that scales fine in networked environments and single core CPUs,
particularly in memory-intensive programs.

The bad performance has been variously ascribed to memory bandwidth /contention,to setting processor and memory affinity versus letting the kernelscheduler do its thing,

to poor performance of memcpy, and so on.

All these reasons are interconnected, but it is hard for a simple MPIuser to nail down where the

difficultly really resides, and even more to fix or attenuate the problem.
Hence, the discussion is very useful for mere users like me.

I've seen the discussion of this same issue popping up in severalmailing lists

(Beowulf, Rocks Clusters, MPICH, MITgcm, etc),

On the MPICH list the very same inefficiency of memcpy was blamed as apossible culprit.There were suggestions on the MPICH list that simply using Intel iccinstead of gccto compile the MPI library would improve the situation (due todifferent implementations of memcpy).

It would be great if the OpenMPI developers could shed some light onthis general issue,and perhaps continue here on the users' list the general part of thisdiscussion,which is in essence how to handle OpenMPI efficiently in a shared memorymulti-core environment.


Many thanks,
Gus Correa

--
---------------------------------------------------------------------
Gustavo J. Ponce Correa, PhD - Email: g...@ldeo.columbia.edu
Lamont-Doherty Earth Observatory - Columbia University
P.O. Box 1000 [61 Route 9W] - Palisades, NY, 10964-8000 - USA
---------------------------------------------------------------------


Tim Mattox wrote:

Hi Terry (and others),
I have previously explored this some on Linux/X86-64 and concluded that
Open MPI needs to supply it's own memcpy routine to get good sm performance,
since the memcpy supplied by glibc is not even close to optimal.
We have an unused MCA framework already set up to supply an opal_memcpy.
AFAIK, George and Brian did the original work to set up that framework.
It has been on my to-do list for awhile to start implementing
opal_memcpy components
for the architectures I have access to, and to modify OMPI to actually
use opal_memcpy
where ti makes sense.  Terry, I presume what you suggest could
be dealt with similarly when we are running/building on SPARC.

Any followup discussion on this should probably happen on the
developer mailing list.

On Thu, Aug 14, 2008 at 12:19 PM, Terry Dontje <terry.don...@sun.com> wrote:

Interestingly enough on the SPARC platform the Solaris memcpy's actually use
non-temporal stores for copies >= 64KB.  By default some of the mca
parameters to the sm BTL stop at 32KB.  I've done experimentations of
bumping the sm segment sizes to above 64K and seen incredible speedup on our
M9000 platforms.  I am looking for some nice way to integrate a memcpy that
lowers this boundary to 32KB or lower into Open MPI.
I have not looked into whether Solaris x86/x64 memcpy's use the non-temporal
stores or not.

--td

Message: 1
Date: Thu, 14 Aug 2008 09:28:59 -0400
From: Jeff Squyres <jsquy...@cisco.com>
Subject: Re: [OMPI users] SM btl slows down bandwidth?
To: rbbr...@sandia.gov, Open MPI Users <us...@open-mpi.org>
Message-ID: <562557eb-857c-4ca8-97ad-f294c7fed...@cisco.com>
Content-Type: text/plain; charset=US-ASCII; format=flowed; delsp=yes

At this time, we are not using non-temporal stores for shared memory
operations.

On Aug 13, 2008, at 11:46 AM, Ron Brightwell wrote:

[...]

MPICH2 manages to get about 5GB/s in shared memory performance on the
Xeon 5420 system.

Does the sm btl use a memcpy with non-temporal stores like MPICH2?
This can be a big win for bandwidth benchmarks that don't actually
touch their receive buffers at all...

-Ron


_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

-- Jeff Squyres Cisco Systems

_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

Re: [OMPI users] SM btl slows down bandwidth?

Reply via email to